History log of /openbsd-current/sys/sys/event.h
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.71 20-Aug-2023 visa

Add kqueue1() system call

kqueue1() takes the flags argument. This lets the kqueue file descriptor
be opened with O_CLOEXEC. Adapted from NetBSD.

OK guenther@


# 1.70 13-Aug-2023 visa

kevent: Add precision and abstimer flags for EVFILT_TIMER

Add timer precision flags NOTE_SECONDS, NOTE_MSECONDS, NOTE_USECONDS
and NOTE_NSECONDS for EVFILT_TIMER. Also, add an initial implementation
of NOTE_ABSTIME timers.

Similar kevent(2) flags exist on FreeBSD, NetBSD and XNU.

Initial diff by and OK aisha@
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.69 10-Feb-2023 visa

Adjust knote(9) API

Make knote(9) lock the knote list internally, and add knote_locked(9)
for the typical situation where the list is already locked.

Remove the KNOTE(9) macro to simplify the API.

Manual page OK jmc@
OK mpi@ mvs@


# 1.68 02-Feb-2023 mvs

Move the rest of common socket initialization within soalloc().

ok visa@


Revision tags: OPENBSD_7_1_BASE OPENBSD_7_2_BASE
# 1.67 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.66 13-Feb-2022 visa

Add helper functions for f_modify and f_process to condense code

These new functions, knote_modify() and knote_process(), implement
the logic that is common to most f_modify and f_process instances.

The code is inlined so as to not add yet another call frame on the
already towering stack of kqueue functions. Also, the _fn versions
allow direct calling of an event function when there is only one
filter type to handle.


# 1.65 13-Feb-2022 visa

Rename knote_modify() to knote_assign()

This avoids verb overlap with f_modify.


# 1.64 11-Feb-2022 visa

Check klist emptiness instead of NULL pointer in KNOTE()

All callers of KNOTE() supply a non-NULL klist argument. Replace the
NULL pointer check with klist emptiness check as a small optimization.

OK mpi@


# 1.63 11-Feb-2022 visa

Inline klist_empty() for more economic machine code.

OK mpi@


# 1.62 08-Feb-2022 visa

poll(2): Switch to kqueue backend

Implement the poll(2) system call on top of the kqueue subsystem.
This obsoletes the old, non-MP-safe poll backend.

On entering poll(2), the new code translates each pollfd array entry
into a set of knotes. When these knotes receive events through kqueue,
the events are translated back to pollfd format.

Entries in the pollfd array can refer to the same file descriptor with
overlapping event masks. To allow such overlap with knotes, use an extra
kn_pollid key that separates knotes of different pollfd entries.

Adapted from DragonFly BSD, initial implementation by mpi@.

Tested in snaps for three weeks.

OK mpi@


# 1.61 11-Dec-2021 visa

Clarify usage of __EV_POLL and __EV_SELECT

Make __EV_POLL specific to kqueue-based poll(2), to remove overlap
with __EV_SELECT that only select(2) uses.

OK millert@ mpi@


# 1.60 08-Dec-2021 visa

Fix select(2) exceptfds handling of FIFOs and pipes

Prevent select(2) from indicating an exceptional condition when the
other end of a FIFO or pipe is closed.

Originally, select(2) returned an exceptfds event only with a pty or
socket that has out-of-band data pending. millert@ says that OpenBSD
diverged from this by accident when poll(2) and select(2) were changed
to use the same backend code in year 2003.

OK millert@


# 1.59 29-Nov-2021 visa

kqueue: Revise badfd knote handling

When closing a file descriptor and converting the poll/select knotes
into badfd knotes, keep the knotes attached to the by-fd table. This
should prevent kqueue_purge() from returning before the kqueue has
become quiescent. This in turn should fix a
KASSERT(TAILQ_EMPTY(&kq->kq_head)) panic in KQRELE() that bluhm@ has
reported.

The badfd conversion is only needed when a poll/select scan is ongoing.
The system can skip the conversion if the knote is not part of the
active event set.

The code of this commit skips the conversion when the fd is closed by
the same thread that has done the fd polling. This can be improved but
should already cover typical fd usage patterns.

As badfd knotes now hold slots in the by-fd table, kqueue_register()
clears them. poll/select use kqueue_register() to set up a new scan;
any found fd close notification is a leftover from the previous scan.

The new badfd handling should be free of accidental knote accumulation.
This obsoletes kqpoll_dequeue() and lowers kqpoll_init() overhead.

Re-enable lazy removal of poll/select knotes because the panic should
no longer happen.

OK mpi@


# 1.58 12-Nov-2021 visa

Keep knotes between poll/select systems calls

Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.

To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.

This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.

Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.

Tested by anton@ and mpi@
OK mpi@


# 1.57 24-Oct-2021 visa

Set klist lock for sockets to make socket event filters MP-safe

The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.

OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.70 13-Aug-2023 visa

kevent: Add precision and abstimer flags for EVFILT_TIMER

Add timer precision flags NOTE_SECONDS, NOTE_MSECONDS, NOTE_USECONDS
and NOTE_NSECONDS for EVFILT_TIMER. Also, add an initial implementation
of NOTE_ABSTIME timers.

Similar kevent(2) flags exist on FreeBSD, NetBSD and XNU.

Initial diff by and OK aisha@
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.69 10-Feb-2023 visa

Adjust knote(9) API

Make knote(9) lock the knote list internally, and add knote_locked(9)
for the typical situation where the list is already locked.

Remove the KNOTE(9) macro to simplify the API.

Manual page OK jmc@
OK mpi@ mvs@


# 1.68 02-Feb-2023 mvs

Move the rest of common socket initialization within soalloc().

ok visa@


Revision tags: OPENBSD_7_1_BASE OPENBSD_7_2_BASE
# 1.67 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.66 13-Feb-2022 visa

Add helper functions for f_modify and f_process to condense code

These new functions, knote_modify() and knote_process(), implement
the logic that is common to most f_modify and f_process instances.

The code is inlined so as to not add yet another call frame on the
already towering stack of kqueue functions. Also, the _fn versions
allow direct calling of an event function when there is only one
filter type to handle.


# 1.65 13-Feb-2022 visa

Rename knote_modify() to knote_assign()

This avoids verb overlap with f_modify.


# 1.64 11-Feb-2022 visa

Check klist emptiness instead of NULL pointer in KNOTE()

All callers of KNOTE() supply a non-NULL klist argument. Replace the
NULL pointer check with klist emptiness check as a small optimization.

OK mpi@


# 1.63 11-Feb-2022 visa

Inline klist_empty() for more economic machine code.

OK mpi@


# 1.62 08-Feb-2022 visa

poll(2): Switch to kqueue backend

Implement the poll(2) system call on top of the kqueue subsystem.
This obsoletes the old, non-MP-safe poll backend.

On entering poll(2), the new code translates each pollfd array entry
into a set of knotes. When these knotes receive events through kqueue,
the events are translated back to pollfd format.

Entries in the pollfd array can refer to the same file descriptor with
overlapping event masks. To allow such overlap with knotes, use an extra
kn_pollid key that separates knotes of different pollfd entries.

Adapted from DragonFly BSD, initial implementation by mpi@.

Tested in snaps for three weeks.

OK mpi@


# 1.61 11-Dec-2021 visa

Clarify usage of __EV_POLL and __EV_SELECT

Make __EV_POLL specific to kqueue-based poll(2), to remove overlap
with __EV_SELECT that only select(2) uses.

OK millert@ mpi@


# 1.60 08-Dec-2021 visa

Fix select(2) exceptfds handling of FIFOs and pipes

Prevent select(2) from indicating an exceptional condition when the
other end of a FIFO or pipe is closed.

Originally, select(2) returned an exceptfds event only with a pty or
socket that has out-of-band data pending. millert@ says that OpenBSD
diverged from this by accident when poll(2) and select(2) were changed
to use the same backend code in year 2003.

OK millert@


# 1.59 29-Nov-2021 visa

kqueue: Revise badfd knote handling

When closing a file descriptor and converting the poll/select knotes
into badfd knotes, keep the knotes attached to the by-fd table. This
should prevent kqueue_purge() from returning before the kqueue has
become quiescent. This in turn should fix a
KASSERT(TAILQ_EMPTY(&kq->kq_head)) panic in KQRELE() that bluhm@ has
reported.

The badfd conversion is only needed when a poll/select scan is ongoing.
The system can skip the conversion if the knote is not part of the
active event set.

The code of this commit skips the conversion when the fd is closed by
the same thread that has done the fd polling. This can be improved but
should already cover typical fd usage patterns.

As badfd knotes now hold slots in the by-fd table, kqueue_register()
clears them. poll/select use kqueue_register() to set up a new scan;
any found fd close notification is a leftover from the previous scan.

The new badfd handling should be free of accidental knote accumulation.
This obsoletes kqpoll_dequeue() and lowers kqpoll_init() overhead.

Re-enable lazy removal of poll/select knotes because the panic should
no longer happen.

OK mpi@


# 1.58 12-Nov-2021 visa

Keep knotes between poll/select systems calls

Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.

To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.

This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.

Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.

Tested by anton@ and mpi@
OK mpi@


# 1.57 24-Oct-2021 visa

Set klist lock for sockets to make socket event filters MP-safe

The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.

OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.69 10-Feb-2023 visa

Adjust knote(9) API

Make knote(9) lock the knote list internally, and add knote_locked(9)
for the typical situation where the list is already locked.

Remove the KNOTE(9) macro to simplify the API.

Manual page OK jmc@
OK mpi@ mvs@


# 1.68 02-Feb-2023 mvs

Move the rest of common socket initialization within soalloc().

ok visa@


Revision tags: OPENBSD_7_1_BASE OPENBSD_7_2_BASE
# 1.67 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.66 13-Feb-2022 visa

Add helper functions for f_modify and f_process to condense code

These new functions, knote_modify() and knote_process(), implement
the logic that is common to most f_modify and f_process instances.

The code is inlined so as to not add yet another call frame on the
already towering stack of kqueue functions. Also, the _fn versions
allow direct calling of an event function when there is only one
filter type to handle.


# 1.65 13-Feb-2022 visa

Rename knote_modify() to knote_assign()

This avoids verb overlap with f_modify.


# 1.64 11-Feb-2022 visa

Check klist emptiness instead of NULL pointer in KNOTE()

All callers of KNOTE() supply a non-NULL klist argument. Replace the
NULL pointer check with klist emptiness check as a small optimization.

OK mpi@


# 1.63 11-Feb-2022 visa

Inline klist_empty() for more economic machine code.

OK mpi@


# 1.62 08-Feb-2022 visa

poll(2): Switch to kqueue backend

Implement the poll(2) system call on top of the kqueue subsystem.
This obsoletes the old, non-MP-safe poll backend.

On entering poll(2), the new code translates each pollfd array entry
into a set of knotes. When these knotes receive events through kqueue,
the events are translated back to pollfd format.

Entries in the pollfd array can refer to the same file descriptor with
overlapping event masks. To allow such overlap with knotes, use an extra
kn_pollid key that separates knotes of different pollfd entries.

Adapted from DragonFly BSD, initial implementation by mpi@.

Tested in snaps for three weeks.

OK mpi@


# 1.61 11-Dec-2021 visa

Clarify usage of __EV_POLL and __EV_SELECT

Make __EV_POLL specific to kqueue-based poll(2), to remove overlap
with __EV_SELECT that only select(2) uses.

OK millert@ mpi@


# 1.60 08-Dec-2021 visa

Fix select(2) exceptfds handling of FIFOs and pipes

Prevent select(2) from indicating an exceptional condition when the
other end of a FIFO or pipe is closed.

Originally, select(2) returned an exceptfds event only with a pty or
socket that has out-of-band data pending. millert@ says that OpenBSD
diverged from this by accident when poll(2) and select(2) were changed
to use the same backend code in year 2003.

OK millert@


# 1.59 29-Nov-2021 visa

kqueue: Revise badfd knote handling

When closing a file descriptor and converting the poll/select knotes
into badfd knotes, keep the knotes attached to the by-fd table. This
should prevent kqueue_purge() from returning before the kqueue has
become quiescent. This in turn should fix a
KASSERT(TAILQ_EMPTY(&kq->kq_head)) panic in KQRELE() that bluhm@ has
reported.

The badfd conversion is only needed when a poll/select scan is ongoing.
The system can skip the conversion if the knote is not part of the
active event set.

The code of this commit skips the conversion when the fd is closed by
the same thread that has done the fd polling. This can be improved but
should already cover typical fd usage patterns.

As badfd knotes now hold slots in the by-fd table, kqueue_register()
clears them. poll/select use kqueue_register() to set up a new scan;
any found fd close notification is a leftover from the previous scan.

The new badfd handling should be free of accidental knote accumulation.
This obsoletes kqpoll_dequeue() and lowers kqpoll_init() overhead.

Re-enable lazy removal of poll/select knotes because the panic should
no longer happen.

OK mpi@


# 1.58 12-Nov-2021 visa

Keep knotes between poll/select systems calls

Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.

To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.

This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.

Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.

Tested by anton@ and mpi@
OK mpi@


# 1.57 24-Oct-2021 visa

Set klist lock for sockets to make socket event filters MP-safe

The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.

OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.68 02-Feb-2023 mvs

Move the rest of common socket initialization within soalloc().

ok visa@


Revision tags: OPENBSD_7_1_BASE OPENBSD_7_2_BASE
# 1.67 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.66 13-Feb-2022 visa

Add helper functions for f_modify and f_process to condense code

These new functions, knote_modify() and knote_process(), implement
the logic that is common to most f_modify and f_process instances.

The code is inlined so as to not add yet another call frame on the
already towering stack of kqueue functions. Also, the _fn versions
allow direct calling of an event function when there is only one
filter type to handle.


# 1.65 13-Feb-2022 visa

Rename knote_modify() to knote_assign()

This avoids verb overlap with f_modify.


# 1.64 11-Feb-2022 visa

Check klist emptiness instead of NULL pointer in KNOTE()

All callers of KNOTE() supply a non-NULL klist argument. Replace the
NULL pointer check with klist emptiness check as a small optimization.

OK mpi@


# 1.63 11-Feb-2022 visa

Inline klist_empty() for more economic machine code.

OK mpi@


# 1.62 08-Feb-2022 visa

poll(2): Switch to kqueue backend

Implement the poll(2) system call on top of the kqueue subsystem.
This obsoletes the old, non-MP-safe poll backend.

On entering poll(2), the new code translates each pollfd array entry
into a set of knotes. When these knotes receive events through kqueue,
the events are translated back to pollfd format.

Entries in the pollfd array can refer to the same file descriptor with
overlapping event masks. To allow such overlap with knotes, use an extra
kn_pollid key that separates knotes of different pollfd entries.

Adapted from DragonFly BSD, initial implementation by mpi@.

Tested in snaps for three weeks.

OK mpi@


# 1.61 11-Dec-2021 visa

Clarify usage of __EV_POLL and __EV_SELECT

Make __EV_POLL specific to kqueue-based poll(2), to remove overlap
with __EV_SELECT that only select(2) uses.

OK millert@ mpi@


# 1.60 08-Dec-2021 visa

Fix select(2) exceptfds handling of FIFOs and pipes

Prevent select(2) from indicating an exceptional condition when the
other end of a FIFO or pipe is closed.

Originally, select(2) returned an exceptfds event only with a pty or
socket that has out-of-band data pending. millert@ says that OpenBSD
diverged from this by accident when poll(2) and select(2) were changed
to use the same backend code in year 2003.

OK millert@


# 1.59 29-Nov-2021 visa

kqueue: Revise badfd knote handling

When closing a file descriptor and converting the poll/select knotes
into badfd knotes, keep the knotes attached to the by-fd table. This
should prevent kqueue_purge() from returning before the kqueue has
become quiescent. This in turn should fix a
KASSERT(TAILQ_EMPTY(&kq->kq_head)) panic in KQRELE() that bluhm@ has
reported.

The badfd conversion is only needed when a poll/select scan is ongoing.
The system can skip the conversion if the knote is not part of the
active event set.

The code of this commit skips the conversion when the fd is closed by
the same thread that has done the fd polling. This can be improved but
should already cover typical fd usage patterns.

As badfd knotes now hold slots in the by-fd table, kqueue_register()
clears them. poll/select use kqueue_register() to set up a new scan;
any found fd close notification is a leftover from the previous scan.

The new badfd handling should be free of accidental knote accumulation.
This obsoletes kqpoll_dequeue() and lowers kqpoll_init() overhead.

Re-enable lazy removal of poll/select knotes because the panic should
no longer happen.

OK mpi@


# 1.58 12-Nov-2021 visa

Keep knotes between poll/select systems calls

Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.

To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.

This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.

Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.

Tested by anton@ and mpi@
OK mpi@


# 1.57 24-Oct-2021 visa

Set klist lock for sockets to make socket event filters MP-safe

The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.

OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.67 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.66 13-Feb-2022 visa

Add helper functions for f_modify and f_process to condense code

These new functions, knote_modify() and knote_process(), implement
the logic that is common to most f_modify and f_process instances.

The code is inlined so as to not add yet another call frame on the
already towering stack of kqueue functions. Also, the _fn versions
allow direct calling of an event function when there is only one
filter type to handle.


# 1.65 13-Feb-2022 visa

Rename knote_modify() to knote_assign()

This avoids verb overlap with f_modify.


# 1.64 11-Feb-2022 visa

Check klist emptiness instead of NULL pointer in KNOTE()

All callers of KNOTE() supply a non-NULL klist argument. Replace the
NULL pointer check with klist emptiness check as a small optimization.

OK mpi@


# 1.63 11-Feb-2022 visa

Inline klist_empty() for more economic machine code.

OK mpi@


# 1.62 08-Feb-2022 visa

poll(2): Switch to kqueue backend

Implement the poll(2) system call on top of the kqueue subsystem.
This obsoletes the old, non-MP-safe poll backend.

On entering poll(2), the new code translates each pollfd array entry
into a set of knotes. When these knotes receive events through kqueue,
the events are translated back to pollfd format.

Entries in the pollfd array can refer to the same file descriptor with
overlapping event masks. To allow such overlap with knotes, use an extra
kn_pollid key that separates knotes of different pollfd entries.

Adapted from DragonFly BSD, initial implementation by mpi@.

Tested in snaps for three weeks.

OK mpi@


# 1.61 11-Dec-2021 visa

Clarify usage of __EV_POLL and __EV_SELECT

Make __EV_POLL specific to kqueue-based poll(2), to remove overlap
with __EV_SELECT that only select(2) uses.

OK millert@ mpi@


# 1.60 08-Dec-2021 visa

Fix select(2) exceptfds handling of FIFOs and pipes

Prevent select(2) from indicating an exceptional condition when the
other end of a FIFO or pipe is closed.

Originally, select(2) returned an exceptfds event only with a pty or
socket that has out-of-band data pending. millert@ says that OpenBSD
diverged from this by accident when poll(2) and select(2) were changed
to use the same backend code in year 2003.

OK millert@


# 1.59 29-Nov-2021 visa

kqueue: Revise badfd knote handling

When closing a file descriptor and converting the poll/select knotes
into badfd knotes, keep the knotes attached to the by-fd table. This
should prevent kqueue_purge() from returning before the kqueue has
become quiescent. This in turn should fix a
KASSERT(TAILQ_EMPTY(&kq->kq_head)) panic in KQRELE() that bluhm@ has
reported.

The badfd conversion is only needed when a poll/select scan is ongoing.
The system can skip the conversion if the knote is not part of the
active event set.

The code of this commit skips the conversion when the fd is closed by
the same thread that has done the fd polling. This can be improved but
should already cover typical fd usage patterns.

As badfd knotes now hold slots in the by-fd table, kqueue_register()
clears them. poll/select use kqueue_register() to set up a new scan;
any found fd close notification is a leftover from the previous scan.

The new badfd handling should be free of accidental knote accumulation.
This obsoletes kqpoll_dequeue() and lowers kqpoll_init() overhead.

Re-enable lazy removal of poll/select knotes because the panic should
no longer happen.

OK mpi@


# 1.58 12-Nov-2021 visa

Keep knotes between poll/select systems calls

Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.

To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.

This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.

Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.

Tested by anton@ and mpi@
OK mpi@


# 1.57 24-Oct-2021 visa

Set klist lock for sockets to make socket event filters MP-safe

The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.

OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.66 13-Feb-2022 visa

Add helper functions for f_modify and f_process to condense code

These new functions, knote_modify() and knote_process(), implement
the logic that is common to most f_modify and f_process instances.

The code is inlined so as to not add yet another call frame on the
already towering stack of kqueue functions. Also, the _fn versions
allow direct calling of an event function when there is only one
filter type to handle.


# 1.65 13-Feb-2022 visa

Rename knote_modify() to knote_assign()

This avoids verb overlap with f_modify.


# 1.64 11-Feb-2022 visa

Check klist emptiness instead of NULL pointer in KNOTE()

All callers of KNOTE() supply a non-NULL klist argument. Replace the
NULL pointer check with klist emptiness check as a small optimization.

OK mpi@


# 1.63 11-Feb-2022 visa

Inline klist_empty() for more economic machine code.

OK mpi@


# 1.62 08-Feb-2022 visa

poll(2): Switch to kqueue backend

Implement the poll(2) system call on top of the kqueue subsystem.
This obsoletes the old, non-MP-safe poll backend.

On entering poll(2), the new code translates each pollfd array entry
into a set of knotes. When these knotes receive events through kqueue,
the events are translated back to pollfd format.

Entries in the pollfd array can refer to the same file descriptor with
overlapping event masks. To allow such overlap with knotes, use an extra
kn_pollid key that separates knotes of different pollfd entries.

Adapted from DragonFly BSD, initial implementation by mpi@.

Tested in snaps for three weeks.

OK mpi@


# 1.61 11-Dec-2021 visa

Clarify usage of __EV_POLL and __EV_SELECT

Make __EV_POLL specific to kqueue-based poll(2), to remove overlap
with __EV_SELECT that only select(2) uses.

OK millert@ mpi@


# 1.60 08-Dec-2021 visa

Fix select(2) exceptfds handling of FIFOs and pipes

Prevent select(2) from indicating an exceptional condition when the
other end of a FIFO or pipe is closed.

Originally, select(2) returned an exceptfds event only with a pty or
socket that has out-of-band data pending. millert@ says that OpenBSD
diverged from this by accident when poll(2) and select(2) were changed
to use the same backend code in year 2003.

OK millert@


# 1.59 29-Nov-2021 visa

kqueue: Revise badfd knote handling

When closing a file descriptor and converting the poll/select knotes
into badfd knotes, keep the knotes attached to the by-fd table. This
should prevent kqueue_purge() from returning before the kqueue has
become quiescent. This in turn should fix a
KASSERT(TAILQ_EMPTY(&kq->kq_head)) panic in KQRELE() that bluhm@ has
reported.

The badfd conversion is only needed when a poll/select scan is ongoing.
The system can skip the conversion if the knote is not part of the
active event set.

The code of this commit skips the conversion when the fd is closed by
the same thread that has done the fd polling. This can be improved but
should already cover typical fd usage patterns.

As badfd knotes now hold slots in the by-fd table, kqueue_register()
clears them. poll/select use kqueue_register() to set up a new scan;
any found fd close notification is a leftover from the previous scan.

The new badfd handling should be free of accidental knote accumulation.
This obsoletes kqpoll_dequeue() and lowers kqpoll_init() overhead.

Re-enable lazy removal of poll/select knotes because the panic should
no longer happen.

OK mpi@


# 1.58 12-Nov-2021 visa

Keep knotes between poll/select systems calls

Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.

To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.

This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.

Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.

Tested by anton@ and mpi@
OK mpi@


# 1.57 24-Oct-2021 visa

Set klist lock for sockets to make socket event filters MP-safe

The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.

OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.64 11-Feb-2022 visa

Check klist emptiness instead of NULL pointer in KNOTE()

All callers of KNOTE() supply a non-NULL klist argument. Replace the
NULL pointer check with klist emptiness check as a small optimization.

OK mpi@


# 1.63 11-Feb-2022 visa

Inline klist_empty() for more economic machine code.

OK mpi@


# 1.62 08-Feb-2022 visa

poll(2): Switch to kqueue backend

Implement the poll(2) system call on top of the kqueue subsystem.
This obsoletes the old, non-MP-safe poll backend.

On entering poll(2), the new code translates each pollfd array entry
into a set of knotes. When these knotes receive events through kqueue,
the events are translated back to pollfd format.

Entries in the pollfd array can refer to the same file descriptor with
overlapping event masks. To allow such overlap with knotes, use an extra
kn_pollid key that separates knotes of different pollfd entries.

Adapted from DragonFly BSD, initial implementation by mpi@.

Tested in snaps for three weeks.

OK mpi@


# 1.61 11-Dec-2021 visa

Clarify usage of __EV_POLL and __EV_SELECT

Make __EV_POLL specific to kqueue-based poll(2), to remove overlap
with __EV_SELECT that only select(2) uses.

OK millert@ mpi@


# 1.60 08-Dec-2021 visa

Fix select(2) exceptfds handling of FIFOs and pipes

Prevent select(2) from indicating an exceptional condition when the
other end of a FIFO or pipe is closed.

Originally, select(2) returned an exceptfds event only with a pty or
socket that has out-of-band data pending. millert@ says that OpenBSD
diverged from this by accident when poll(2) and select(2) were changed
to use the same backend code in year 2003.

OK millert@


# 1.59 29-Nov-2021 visa

kqueue: Revise badfd knote handling

When closing a file descriptor and converting the poll/select knotes
into badfd knotes, keep the knotes attached to the by-fd table. This
should prevent kqueue_purge() from returning before the kqueue has
become quiescent. This in turn should fix a
KASSERT(TAILQ_EMPTY(&kq->kq_head)) panic in KQRELE() that bluhm@ has
reported.

The badfd conversion is only needed when a poll/select scan is ongoing.
The system can skip the conversion if the knote is not part of the
active event set.

The code of this commit skips the conversion when the fd is closed by
the same thread that has done the fd polling. This can be improved but
should already cover typical fd usage patterns.

As badfd knotes now hold slots in the by-fd table, kqueue_register()
clears them. poll/select use kqueue_register() to set up a new scan;
any found fd close notification is a leftover from the previous scan.

The new badfd handling should be free of accidental knote accumulation.
This obsoletes kqpoll_dequeue() and lowers kqpoll_init() overhead.

Re-enable lazy removal of poll/select knotes because the panic should
no longer happen.

OK mpi@


# 1.58 12-Nov-2021 visa

Keep knotes between poll/select systems calls

Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.

To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.

This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.

Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.

Tested by anton@ and mpi@
OK mpi@


# 1.57 24-Oct-2021 visa

Set klist lock for sockets to make socket event filters MP-safe

The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.

OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.62 08-Feb-2022 visa

poll(2): Switch to kqueue backend

Implement the poll(2) system call on top of the kqueue subsystem.
This obsoletes the old, non-MP-safe poll backend.

On entering poll(2), the new code translates each pollfd array entry
into a set of knotes. When these knotes receive events through kqueue,
the events are translated back to pollfd format.

Entries in the pollfd array can refer to the same file descriptor with
overlapping event masks. To allow such overlap with knotes, use an extra
kn_pollid key that separates knotes of different pollfd entries.

Adapted from DragonFly BSD, initial implementation by mpi@.

Tested in snaps for three weeks.

OK mpi@


# 1.61 11-Dec-2021 visa

Clarify usage of __EV_POLL and __EV_SELECT

Make __EV_POLL specific to kqueue-based poll(2), to remove overlap
with __EV_SELECT that only select(2) uses.

OK millert@ mpi@


# 1.60 08-Dec-2021 visa

Fix select(2) exceptfds handling of FIFOs and pipes

Prevent select(2) from indicating an exceptional condition when the
other end of a FIFO or pipe is closed.

Originally, select(2) returned an exceptfds event only with a pty or
socket that has out-of-band data pending. millert@ says that OpenBSD
diverged from this by accident when poll(2) and select(2) were changed
to use the same backend code in year 2003.

OK millert@


# 1.59 29-Nov-2021 visa

kqueue: Revise badfd knote handling

When closing a file descriptor and converting the poll/select knotes
into badfd knotes, keep the knotes attached to the by-fd table. This
should prevent kqueue_purge() from returning before the kqueue has
become quiescent. This in turn should fix a
KASSERT(TAILQ_EMPTY(&kq->kq_head)) panic in KQRELE() that bluhm@ has
reported.

The badfd conversion is only needed when a poll/select scan is ongoing.
The system can skip the conversion if the knote is not part of the
active event set.

The code of this commit skips the conversion when the fd is closed by
the same thread that has done the fd polling. This can be improved but
should already cover typical fd usage patterns.

As badfd knotes now hold slots in the by-fd table, kqueue_register()
clears them. poll/select use kqueue_register() to set up a new scan;
any found fd close notification is a leftover from the previous scan.

The new badfd handling should be free of accidental knote accumulation.
This obsoletes kqpoll_dequeue() and lowers kqpoll_init() overhead.

Re-enable lazy removal of poll/select knotes because the panic should
no longer happen.

OK mpi@


# 1.58 12-Nov-2021 visa

Keep knotes between poll/select systems calls

Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.

To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.

This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.

Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.

Tested by anton@ and mpi@
OK mpi@


# 1.57 24-Oct-2021 visa

Set klist lock for sockets to make socket event filters MP-safe

The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.

OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.61 11-Dec-2021 visa

Clarify usage of __EV_POLL and __EV_SELECT

Make __EV_POLL specific to kqueue-based poll(2), to remove overlap
with __EV_SELECT that only select(2) uses.

OK millert@ mpi@


# 1.60 08-Dec-2021 visa

Fix select(2) exceptfds handling of FIFOs and pipes

Prevent select(2) from indicating an exceptional condition when the
other end of a FIFO or pipe is closed.

Originally, select(2) returned an exceptfds event only with a pty or
socket that has out-of-band data pending. millert@ says that OpenBSD
diverged from this by accident when poll(2) and select(2) were changed
to use the same backend code in year 2003.

OK millert@


# 1.59 29-Nov-2021 visa

kqueue: Revise badfd knote handling

When closing a file descriptor and converting the poll/select knotes
into badfd knotes, keep the knotes attached to the by-fd table. This
should prevent kqueue_purge() from returning before the kqueue has
become quiescent. This in turn should fix a
KASSERT(TAILQ_EMPTY(&kq->kq_head)) panic in KQRELE() that bluhm@ has
reported.

The badfd conversion is only needed when a poll/select scan is ongoing.
The system can skip the conversion if the knote is not part of the
active event set.

The code of this commit skips the conversion when the fd is closed by
the same thread that has done the fd polling. This can be improved but
should already cover typical fd usage patterns.

As badfd knotes now hold slots in the by-fd table, kqueue_register()
clears them. poll/select use kqueue_register() to set up a new scan;
any found fd close notification is a leftover from the previous scan.

The new badfd handling should be free of accidental knote accumulation.
This obsoletes kqpoll_dequeue() and lowers kqpoll_init() overhead.

Re-enable lazy removal of poll/select knotes because the panic should
no longer happen.

OK mpi@


# 1.58 12-Nov-2021 visa

Keep knotes between poll/select systems calls

Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.

To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.

This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.

Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.

Tested by anton@ and mpi@
OK mpi@


# 1.57 24-Oct-2021 visa

Set klist lock for sockets to make socket event filters MP-safe

The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.

OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.60 08-Dec-2021 visa

Fix select(2) exceptfds handling of FIFOs and pipes

Prevent select(2) from indicating an exceptional condition when the
other end of a FIFO or pipe is closed.

Originally, select(2) returned an exceptfds event only with a pty or
socket that has out-of-band data pending. millert@ says that OpenBSD
diverged from this by accident when poll(2) and select(2) were changed
to use the same backend code in year 2003.

OK millert@


# 1.59 29-Nov-2021 visa

kqueue: Revise badfd knote handling

When closing a file descriptor and converting the poll/select knotes
into badfd knotes, keep the knotes attached to the by-fd table. This
should prevent kqueue_purge() from returning before the kqueue has
become quiescent. This in turn should fix a
KASSERT(TAILQ_EMPTY(&kq->kq_head)) panic in KQRELE() that bluhm@ has
reported.

The badfd conversion is only needed when a poll/select scan is ongoing.
The system can skip the conversion if the knote is not part of the
active event set.

The code of this commit skips the conversion when the fd is closed by
the same thread that has done the fd polling. This can be improved but
should already cover typical fd usage patterns.

As badfd knotes now hold slots in the by-fd table, kqueue_register()
clears them. poll/select use kqueue_register() to set up a new scan;
any found fd close notification is a leftover from the previous scan.

The new badfd handling should be free of accidental knote accumulation.
This obsoletes kqpoll_dequeue() and lowers kqpoll_init() overhead.

Re-enable lazy removal of poll/select knotes because the panic should
no longer happen.

OK mpi@


# 1.58 12-Nov-2021 visa

Keep knotes between poll/select systems calls

Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.

To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.

This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.

Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.

Tested by anton@ and mpi@
OK mpi@


# 1.57 24-Oct-2021 visa

Set klist lock for sockets to make socket event filters MP-safe

The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.

OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.59 29-Nov-2021 visa

kqueue: Revise badfd knote handling

When closing a file descriptor and converting the poll/select knotes
into badfd knotes, keep the knotes attached to the by-fd table. This
should prevent kqueue_purge() from returning before the kqueue has
become quiescent. This in turn should fix a
KASSERT(TAILQ_EMPTY(&kq->kq_head)) panic in KQRELE() that bluhm@ has
reported.

The badfd conversion is only needed when a poll/select scan is ongoing.
The system can skip the conversion if the knote is not part of the
active event set.

The code of this commit skips the conversion when the fd is closed by
the same thread that has done the fd polling. This can be improved but
should already cover typical fd usage patterns.

As badfd knotes now hold slots in the by-fd table, kqueue_register()
clears them. poll/select use kqueue_register() to set up a new scan;
any found fd close notification is a leftover from the previous scan.

The new badfd handling should be free of accidental knote accumulation.
This obsoletes kqpoll_dequeue() and lowers kqpoll_init() overhead.

Re-enable lazy removal of poll/select knotes because the panic should
no longer happen.

OK mpi@


# 1.58 12-Nov-2021 visa

Keep knotes between poll/select systems calls

Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.

To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.

This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.

Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.

Tested by anton@ and mpi@
OK mpi@


# 1.57 24-Oct-2021 visa

Set klist lock for sockets to make socket event filters MP-safe

The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.

OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.58 12-Nov-2021 visa

Keep knotes between poll/select systems calls

Reduce the time overhead of kqueue-based poll(2) and select(2) by
keeping knotes registered between the system calls. It is expected that
the set of monitored file descriptors is relatively unchanged between
consecutive iterations of these system calls. By keeping the knotes,
the system saves the effort of repeated knote unregistering and
re-registering.

To avoid receiving events from file descriptors that are no longer in
the monitored set, each poll/select knote is assigned an increasing
serial number. Every iteration of poll/select uses a previously unused
range of serials for its knotes. In the setup stage, kqueue_register()
updates the serials of any existing knotes in the currently monitored
set. Function kqueue_scan() delivers only the events whose serials are
recent enough; expired knotes are dropped. When the serial range is
about to wrap around, all the knotes in the kqueue backend are dropped.

This change is a space-time tradeoff. Memory usage is increased somewhat
because of the retained knotes. The increase is limited by the number
of open file descriptors and active threads.

Idea from DragonFly BSD, initial patch by mpi@, kqueue_scan()-based
approach by me.

Tested by anton@ and mpi@
OK mpi@


# 1.57 24-Oct-2021 visa

Set klist lock for sockets to make socket event filters MP-safe

The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.

OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.57 24-Oct-2021 visa

Set klist lock for sockets to make socket event filters MP-safe

The filterops instances already provide f_modify and f_process
callbacks with proper internal locking. Locking of socket klists
has been the missing detail for MP-safety.

OK mpi@


Revision tags: OPENBSD_7_0_BASE
# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.56 16-Jun-2021 visa

kqueue: kq_lock is needed when updating kn_status

The kn_status field of struct knote is part of kqueue's internal state.
When kn_status is being updated, kq_lock has to be locked. This is true
even with MP-unsafe event filters.

OK mpi@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.55 02-Jun-2021 visa

Enable pool cache on knote pool

Use the pool cache to reduce the overhead of memory management in
function kqueue_register().

When EV_ADD is given, kqueue_register() pre-allocates a knote to avoid
potential sleeping in the middle of the critical section that spans
from knote lookup to insertion. However, the pre-allocation is useless
if the lookup finds a matching knote.

The cost of knote allocation will become significant with kqueue-based
poll(2) and select(2) because the frequency of allocation will increase.
Most of the cost appears to come from the locking inside the pool.
The pool cache amortizes it by using CPU-local caches of free knotes
as buffers.

OK dlg@ mpi@


Revision tags: OPENBSD_6_9_BASE
# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.54 24-Feb-2021 visa

kqueue: Revise filterops interface

Extend kqueue's filterops interface with new callbacks so that it
becomes easier to use with fine-grained locking. The new interface
delegates the serialization of kn_event access to event sources. Now
kqueue uses filterops callbacks to read or write kn_event. This hides
event sources' locking patterns from kqueue, and allows clean
implementation of atomic read-and-clear for EV_CLEAR, for instance.

There are so many existing filterops instances that converting all of
them in one go is tricky. This patch adds a wrapper mechanism that
kqueue uses when the new callbacks are missing.

The new filterops interface has been influenced by XNU's kqueue.

OK mpi@ semarie@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.53 17-Jan-2021 visa

kqueue: Revise fd close notification

Deliver file descriptor close notification for __EV_POLL knotes through
struct kevent that kqueue_scan() returns. This replaces the previous way
of returning EBADF from kqueue_scan(), making it easier to determine
what exactly has changed.

When a file descriptor is closed, its __EV_POLL knotes are turned into
one-shot events and queued for delivery. These knotes are "unregistered"
as they are reachable only through the queue of active events. This
reduces interference with the normal workings of kqueue. However, more
care is needed to avoid leaking knotes. In addition, the unregistering
removes a limit on the number of issued knotes. To prevent accumulation
of pending fd close notifications, kqpoll_init() flushes the active
queue at the start of a kqpoll scan.

OK mpi@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.52 25-Dec-2020 visa

Refactor klist insertion and removal

Rename klist_{insert,remove}() to klist_{insert,remove}_locked().
These functions assume that the caller has locked the klist. The current
state of locking remains intact because the kernel lock is still used
with all klists.

Add new functions klist_insert() and klist_remove() that lock the klist
internally. This allows some code simplification.

OK mpi@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.51 20-Dec-2020 visa

Introduce klistops

This patch extends struct klist with a callback descriptor and
an argument. The main purpose of this is to let the kqueue subsystem
assert when a klist should be locked, and operate the klist lock
in klist_invalidate().

Access to a knote list of a kqueue-monitored object has to be
serialized somehow. Because the object often has a lock for protecting
its state, and because the object often acquires this lock at the latest
in its f_event callback function, it makes sense to use this lock also
for the knote lists. The existing uses of NOTE_SUBMIT already show
a pattern that is likely to become more prevalent.

There could be an embedded lock in klist. However, such a lock would be
redundant in many cases. The code cannot rely on a single lock type
(mutex, rwlock, something else) because the needs of monitored objects
vary. In addition, an embedded lock would introduce new lock order
constraints. Note that the patch does not rule out use of dedicated
klist locks.

The patch introduces a way to associate lock operations with a klist.
The caller can provide a custom implementation, or use a ready-made
interface with a mutex or rwlock.

For compatibility with old code, the new code falls back to using the
kernel lock if no specific klist initialization has been done. The
existing code already relies on implicit initialization of klist.

Sadly, this change increases the size of struct klist. dlg@ thinks this
is not fatal, though.

OK mpi@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.50 18-Dec-2020 visa

Make knote_{activate,remove}() internal to kern_event.c.

OK mpi@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.49 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.48 07-Dec-2020 mpi

Refactor kqueue_scan() so it can be used by other syscalls.

Stop iterating in the function and instead copy the returned events to
userland after every call.

ok visa@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.47 25-Nov-2020 mpi

Change kqueue_scan() to keep track of collected events in the given context.

It is now possible to call the function multiple times to collect events.
For that, the end marker has to be preserved between calls because otherwise
the scan might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

This is required to implement select(2) and poll(2) on top of kqueue_scan().

Done & originally committed by visa@ in r1.143, in snap for more than 2 weeks.

ok visa@, anton@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.46 11-Oct-2020 mpi

Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct".

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion.

Extracted from a previous diff from visa@.

ok visa@, anton@


Revision tags: OPENBSD_6_8_BASE
# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.45 23-Aug-2020 mpi

Allow userland to use EVFILT_EXCEPT.

ok mvs@, visa@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.44 22-Jun-2020 mpi

Extend kqueue interface with EVFILT_EXCEPT filter.

This filter, already implemented in macOS and Dragonfly BSD, returns
exceptional conditions like the reception of out-of-band data.

The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and
it can be used by the kqfilter-based poll & select implementation.

ok millert@ on a previous version, ok visa@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.43 15-Jun-2020 mpi

Implement a simple kqfilter for deadfs matching its poll handler.

ok visa@, millert@


# 1.42 15-Jun-2020 mpi

Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found.

This is only done in poll-compatibility mode, when __EV_POLL is set.

ok visa@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.41 12-Jun-2020 mpi

Revert addition of double underbars for filter-specific flag.

Port breakages reported by naddy@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.40 11-Jun-2020 mpi

Rename poll-compatibility flag to better reflect what it is.

While here prefix kernel-only EV flags with two underbars.

Suggested by kettenis@, ok visa@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.39 08-Jun-2020 mpi

Use a new EV_OLDAPI flag to match the behavior of poll(2) and select(2).

Adapt FS kqfilters to always return true when the flag is set and bypass
the polling mechanism of the NFS thread.

While here implement a write filter for NFS.

ok visa@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.38 25-May-2020 visa

Revert "Add kqueue_scan_state struct"

sthen@ has reported that the patch might be causing hangs with X.


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.37 17-May-2020 visa

Add kqueue_scan_state struct

The struct keeps track of the end point of an event queue scan by
persisting the end marker. This will be needed when kqueue_scan() is
called repeatedly to complete a scan in a piecewise fashion. The end
marker has to be preserved between calls because otherwise the scan
might collect an event more than once. If a collected event gets
reactivated during scanning, it will be added at the tail of the queue,
out of reach because of the end marker.

OK mpi@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.36 10-May-2020 guenther

Use a double-underscore prefix for local variables declared in macros
that have arguments. Document this requirement/recommendation in style(9)

prompted by mpi@
ok deraadt@


Revision tags: OPENBSD_6_7_BASE
# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.35 07-Apr-2020 visa

Abstract the head of knote lists. This allows extending the lists,
for example, with locking assertions.

OK mpi@, anton@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.34 04-Apr-2020 mpi

Prevent shadowing of local variable by the EV_SET() macro.

Use two underbars to start the locally defined variable, as suggested by
guenther@. The other option to avoid namespace conflict would be to start
the identifier with an underbar and a capital.

ok beck@, guenther@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.33 20-Feb-2020 visa

Replace field f_isfd with field f_flags in struct filterops to allow
adding more filter properties without cluttering the struct.

OK mpi@, anton@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.32 31-Dec-2019 visa

Use C99 designated initializers with struct filterops. In addition,
make the structs const so that the data are put in .rodata.

OK mpi@, deraadt@, anton@, bluhm@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.31 12-Dec-2019 visa

Allow sleeping inside kqueue event filters.

In kqueue_scan(), threads have to get an exclusive access to a knote
before processing by calling knote_acquire(). This prevents the knote
from being destroyed while it is still in use. knote_acquire() also
blocks other threads from processing the knote. Once knote processing
has finished, the thread has to call knote_release().

The kqueue subsystem is still serialized by the kernel lock. If an event
filter sleeps, the kernel lock is released and another thread might
enter kqueue_scan(). kqueue_scan() uses start and end markers to keep
track of the scan's progress and it has to be aware of other threads'
markers.

This patch is a revised version of mpi@'s work derived
from DragonFly BSD. kqueue_check() has been adapted from NetBSD.

Tested by anton@, sashan@
OK mpi@, anton@, sashan@


Revision tags: OPENBSD_6_3_BASE OPENBSD_6_4_BASE OPENBSD_6_5_BASE OPENBSD_6_6_BASE
# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


# 1.30 13-Jan-2018 robert

introduce a filter called EVFILT_DEVICE that can be used to notify
listeners of device state changes.
currently only supports NOTE_CHANGE that will be used by drm(4)

ok kettenis@


# 1.29 21-Dec-2017 millert

Expand u_short and u_int to unsigned short and unsigned int
respectively to avoid compilation errors when one of the POSIX or
X/OPEN version macros is defined. Also sync the field descriptions
with kqueue.2. OK deraadt@


# 1.28 18-Dec-2017 mpi

Revert support for multiple threads to enter kqueue_scan() in parallel.

It is not clear if this change is responsible for the lockups experienced
by dhill@ and jcs@ but since we're no longer grabbing the socket lock in
kqueue(2) filters there's no need for this change.


# 1.27 04-Nov-2017 mpi

Make it possible for multiple threads to enter kqueue_scan() in parallel.

This is a requirement to use a sleeping lock inside kqueue filters.
It is now possible, but not recommended, to sleep inside ``f_event''.

Threads iterating over the list of pending events are now recognizing
and skipping other threads' markers. knote_acquire() and knote_release()
must be used to "own" a knote to make sure no other thread is sleeping
with a reference on it.

Acquire and marker logic taken from DragonFly but the KERNEL_LOCK()
is still serializing the execution of the kqueue code.

This also enable the NET_LOCK() in socket filters.

Tested by abieber@ & juanfra@, run by naddy@ in a bulk, ok visa@, bluhm@


Revision tags: OPENBSD_6_2_BASE
# 1.26 26-Jun-2017 mpi

Assert that the corresponding socket is locked when manipulating socket
buffers.

This is one step towards unlocking TCP input path. Note that all the
functions asserting for the socket lock are not necessarilly MP-safe.
All the fields of 'struct socket' aren't protected.

Introduce a new kernel-only kqueue hint, NOTE_SUBMIT, to be able to
tell when a filter needs to lock the underlying data structures. Logic
and name taken from NetBSD.

Tested by Hrvoje Popovski.

ok claudio@, bluhm@, mikeb@


# 1.25 31-May-2017 mikeb

Add support for EV_RECEIPT and EV_DISPATCH flags

From FreeBSD via Jan Schreiber <jes at posteo ! de>, thanks!
OK tedu, bluhm


# 1.24 31-May-2017 tedu

make a copy of the first EV_SET argument to prevent multiple evaluation.
matches freebsd, fixes lldb. from Kamil Rytarowski at NetBSD.
while here, make the same change to KNOTE. ok deraadt


Revision tags: OPENBSD_6_1_BASE
# 1.23 24-Sep-2016 tedu

move knhash size to event.h, use it for hashfree. from Mathieu -
ok guenther


# 1.22 13-Aug-2016 tedu

modern interfaces should use modern speelings, so spell quad_t as int64_t.


Revision tags: OPENBSD_5_9_BASE OPENBSD_6_0_BASE
# 1.21 06-Oct-2015 guenther

struct knote's kn_sdata needs to be the same type as struct kevent's data

ok deraadt@


Revision tags: OPENBSD_5_6_BASE OPENBSD_5_7_BASE OPENBSD_5_8_BASE
# 1.20 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


Revision tags: OPENBSD_5_5_BASE
# 1.19 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.18 24-Apr-2013 nicm

When a ucom(4) is removed, it frees the tty with ttyfree(). However if
anyone is waiting with kqueue their knotes may still have a reference to
the tty and later try to use it in the filt_tty* functions.

To avoid this, walk the knotes in ttyfree(), remove them from the tty's
list and invalidate them by setting kn_hook to NODEV. The filter
functions can then check for this and safely ignore the knotes.

ok tedu matthieu


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.17 08-Jun-2012 guenther

Missed a comment in the proc->process change


# 1.16 06-Jun-2012 guenther

EVFILT_SIGNAL and EVFILT_PROC events need to track the process they're
attached to and not just the thread, which can go away.

Problem observed by jsg@; ok jsg@ matthew@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.15 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.14 28-Jul-2010 nicm

Add a dummy kqueue filter similar to seltrue and use it for anything
using seltrue for poll. Based on code from NetBSD.

Also remove a stray duplicate lpt entry from loongson, from deraadt.

ok tedu deraadt


Revision tags: OPENBSD_4_5_BASE OPENBSD_4_6_BASE OPENBSD_4_7_BASE
# 1.13 05-Nov-2008 dlg

wrap use of KNOTE macro arguments in () to prevent potential strange
expansion.

requested by otto@


# 1.12 05-Nov-2008 dlg

wrap an if statement in a macro up with do { } while (0) so it is safe to
use in other if/else blocks.

"yeah" deraadt@


Revision tags: OPENBSD_4_2_BASE OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.11 30-May-2007 tedu

add a new kevent filter type for timers. this allows processes to create
a series of oneshot or periodic timers. capped to a global limit.
from freebsd via brad.
ok art pedro


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.10 19-Dec-2005 millert

Change sys/select.h -> sys/selinfo.h in comment.


Revision tags: OPENBSD_3_5_BASE OPENBSD_3_6_BASE OPENBSD_3_7_BASE OPENBSD_3_8_BASE SMP_SYNC_A SMP_SYNC_B
# 1.9 12-Jan-2004 tedu

klist_invalidate to help clean up when the backend disappears, tested by mpf@


# 1.8 17-Dec-2003 tedu

add NOTE_EOF (return on EOF) and NOTE_TRUNCATE (vnode was truncated)
to kqueue
from marius@monkey tested by brad@


Revision tags: OPENBSD_3_4_BASE
# 1.7 22-Jul-2003 tedu

void *, not caddr_t. missed in last commit. thanks Marco Peereboom


# 1.6 27-Jun-2003 nate

filter event that simulates seltrue(). From NetBSD


# 1.5 22-May-2003 nate

filterops doesn't need to change, so we can make it const
ok deraadt@


Revision tags: OPENBSD_3_1_BASE OPENBSD_3_2_BASE OPENBSD_3_3_BASE UBC_SYNC_A UBC_SYNC_B
# 1.4 14-Mar-2002 millert

First round of __P removal in sys


Revision tags: OPENBSD_2_9_BASE OPENBSD_3_0_BASE UBC_BASE
# 1.3 01-Mar-2001 provos

branches: 1.3.4; 1.3.8;
port kqueue changes from freebsd, plus all required openbsd glue.
okay deraadt@, millert@
from jlemon@freebsd.org:
extend kqueue down to the device layer, backwards compatible approach
suggested by peter@freebsd.org


# 1.2 16-Nov-2000 mickey

rcsid; lots of bad tabs and spaces


# 1.1 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@