History log of /freebsd-current/sys/kern/kern_event.c
Revision Date Author Comments
# f28526e9 19-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

kcmp(2): implement for generic file types

Reviewed by: brooks, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43518


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 67f938c5 02-Jun-2023 Mark Johnston <markj@FreeBSD.org>

kevent: Make references to filter definitions const

Follow-up revisions can make individual filter definitions const. No
functional change intended.

Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35842


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# dac31024 31-Mar-2023 Konstantin Belousov <kib@FreeBSD.org>

Rename kqueue1(2) to kqueuex(2) to avoid compat issues with NetBSD

Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39377


# 61194e98 25-Mar-2023 Konstantin Belousov <kib@FreeBSD.org>

Add kqueue1() syscall

It takes the flags argument. Immediate use is to provide the KQUEUE_CLOEXEC
flag for kqueue(2).

Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39271


# 3454a7ca 20-Aug-2022 Robert Wing <rew@FreeBSD.org>

kqueue: retire knlist_init_rw_reader()

Last usage was removed in afa85850e79c1839ec33efa1138206687b952cfa.

Reviewed by: pauamma, melifaro, kib
Differential Revision: https://reviews.freebsd.org/D36205


# c6d31b83 18-Jul-2022 Konstantin Belousov <kib@FreeBSD.org>

AST: rework

Make most AST handlers dynamically registered. This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it. For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.

Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit. For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.

The dynamic registration also allows third-party modules to register AST
handlers if needed. There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it. In fact, this
is already present behavior for hwpmc.ko and ufs.ko. I do not think it
is worth the efforts and the runtime overhead to try to fix it.

Reviewed by: markj
Tested by: emaste (arm64), pho
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888


# 8c309d48 17-Jun-2022 Damjan Jovanovic <damjan.jov@gmail.com>

struct kinfo_file changes needed for lsof to work using only usermode APIs`

Add kf_pipe_buffer_[in/out/size] fields to kf_pipe, and populate them.

Add a kf_kqueue struct to the kf_un union, to allow querying kqueue state,
and populate it.

Populate the kf_sock_rcv_sb_state and kf_sock_snd_sb_state fields in
kf_sock for INET/INET6 sockets, and populate all other fields for all
transport layer protocols, not just TCP.

Bump __FreeBSD_version.

Differential revision: https://reviews.freebsd.org/D34184
Reviewed by: jhb, kib, se
MFC after: 1 week


# 524dadf7 24-May-2022 Mark Johnston <markj@FreeBSD.org>

kevent: Fix an off-by-one in filt_timerexpire_l()

Suppose a periodic kevent timer fires close to its deadline, so that
now - kc->next is small. Then delta ends up being 1, and the next timer
deadline is set to (delta + 1) * kc->to, where kc->to is the timer
period. This means that the timer fires at half of the requested rate,
and the value returned in kn_data is similarly inaccurate.

PR: 264131
Fixes: 7cb40543e964 ("filt_timerexpire: do not iterate over the interval")
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35313


# 2479e381 19-May-2022 Dmitry Chagin <dchagin@FreeBSD.org>

kqueue: Trim trailing whitespace

MFC after: 1 week


# 91e7bdcd 25-Apr-2022 Dmitry Chagin <dchagin@FreeBSD.org>

Add timespecvalid_interval macro and use it.

Reviewed by: jhb, imp (early rev)
Differential revision: https://reviews.freebsd.org/D34848
MFC after: 2 weeks


# c3721292 09-Apr-2022 Gordon Bergling <gbe@FreeBSD.org>

kern: Remove a double word in a source code comment

- s/for for/for/

MFC after: 3 days


# 8e4a3add 15-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

struct kevent_freebsd11 -> struct freebsd11_kevent

Rename to match the naming of syscalls and allow 32 to be appended
without making an ugly name like kevent_freebsd1132.

While here, make the kevent changelist argument const.

Reviewed by: kib


# 2b68eb8e 01-Oct-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove thread argument from VOP_STAT

and fo_stat.


# 2f4dbe27 01-Oct-2021 Kyle Evans <kevans@FreeBSD.org>

kqueue: fix recent assertion

NOTE_ABSTIME may also have a zero timeout, which indicates that we
should still fire immediately as an absolute time in the past. A test
has been added for this one as well.

Fixes: 9c999a259f00 ("kqueue: don't arbitrarily restrict long-past...")
Point hat: kevans
Reported by: syzbot+1c8d1154f560b3930042@syzkaller.appspotmail.com


# 9c999a25 29-Sep-2021 Kyle Evans <kevans@FreeBSD.org>

kqueue: don't arbitrarily restrict long-past values for NOTE_ABSTIME

NOTE_ABSTIME values are converted to values relative to boottime in
filt_timervalidate(), and negative values are currently rejected. We
don't reject times in the past in general, so clamp this up to 0 as
needed such that the timer fires immediately rather than imposing what
looks like an arbitrary restriction.

Another possible scenario is that the system clock had to be adjusted
by ~minutes or ~hours and we have less than that in terms of uptime,
making a reasonable short-timeout suddenly invalid. Firing it is still
a valid choice in this scenario so that applications can at least
expect a consistent behavior.

Reviewed by: kib, markj
Discussed with: allanjude
Differential Revision: https://reviews.freebsd.org/D32230


# 0321a799 23-Sep-2021 Nathaniel Wesley Filardo <nfilardo@microsoft.com>

kqueue: Add EV_KEEPUDATA flag

When this flag is set, operations that update an existing kevent will
not change the udata field. This can be used to NOTE_TRIGGER or
EV_{EN,DIS}ABLE events without overwriting the stashed pointer.

Reviewed by: Domagoj Stolfa <domagoj.stolfa@gmail.com>
Obtained from: CheriBSD
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D30286


# 98168a6e 06-Sep-2021 Konstantin Belousov <kib@FreeBSD.org>

kqueue: drain kqueue taskqueue if syscall tickled it

Otherwise return from the syscall and next syscall, which could be
kevent(2) on the kqueue that should be notified, races with the kqueue
taskqueue thread, and potentially misses the wakeup. This is reliably
visible when kevent(2) only peeks into events using zeroed timeout.

PR: 258310
Reported by: arichardson, Jan Kokemüller <jan.kokemueller@gmail.com>
Reviewed by: arichardson, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31858


# c511383d 01-Sep-2021 Mark Johnston <markj@FreeBSD.org>

kevent: Fix races between timer detach and kqtimer_proc_continue()

- When detaching a knote, we need to double check the enqueued flag
after acquiring the process lock, as kqtimer_proc_continue() may have
toggled it.
- kqtimer_proc_continue() could in principle reschedule a stopped
callout after filt_timerdetach() drains the callout. So, we need to
re-check.

Reported by: syzbot+4a4cebb3ec07892cb040@syzkaller.appspotmail.com
Reported by: syzbot+a9c04bc76078a3b7dd8d@syzkaller.appspotmail.com
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31772


# c9f8dcda 02-Jun-2021 Mateusz Guzik <mjg@FreeBSD.org>

kqueue: replace kq_ncallouts loop with atomic_fetchadd


# e00bae5c 27-May-2021 Mark Johnston <markj@FreeBSD.org>

kevent: Prohibit negative change and event list lengths

Previously, a negative change list length would be treated the same as
an empty change list. A negative event list length would result in
bogus copyouts. Make kevent(2) return EINVAL for both cases so that
application bugs are more easily found, and to be more robust against
future changes to kevent internals.

Reviewed by: imp, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30480


# 2cca77ee 14-May-2021 Mark Johnston <markj@FreeBSD.org>

kqueue timer: Remove detached knotes from the process stop queue

There are some scenarios where a timer event may be detached when it is
on the process' kqueue timer stop queue. If kqtimer_proc_continue() is
called after that point, it will iterate over the queue and access freed
timer structures.

It is also possible, at least in a multithreaded program, for a stopped
timer event to be scheduled without removing it from the process' stop
queue. Ensure that we do not doubly enqueue the event structure in this
case.

Reported by: syzbot+cea0931bb4e34cd728bd@syzkaller.appspotmail.com
Reported by: syzbot+9e1a2f3734652015998c@syzkaller.appspotmail.com
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30251


# 7cb40543 28-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

filt_timerexpire: do not iterate over the interval

User-supplied data might make this loop too time-consuming. Divide
directly, and handle both the possibility that we were woken up earlier,
and arithmetic overflows/underflows from the calculation.

Reported and tested by: pho (previous version)
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30069


# f1f98706 18-Apr-2021 Warner Losh <imp@FreeBSD.org>

Minor style cleanup

We prefer 'while (0)' to 'while(0)' according to grep and stlye(9)'s
space after keyword rule. Remove a few stragglers of the latter.
Many of these usages were inconsistent within the file.

MFC After: 3 days
Sponsored by: Netflix


# 75c5cf7a 13-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

filt_timerexpire: avoid process lock recursion

Found by: syzkaller
Reported and reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29746


# 2fd1ffef 05-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

Stop arming kqueue timers on knote owner suspend or terminate

This way, even if the process specified very tight reschedule
intervals, it should be stoppable/killable.

Reported and reviewed by: markj
Tested by: markj, pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29106


# 533e5057 05-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

Add helper for kqueue timers callout scheduling

Reviewed by: markj
Tested by: markj, pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29106


# 6b3a9a0f 11-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

Convert remaining cap_rights_init users to cap_rights_init_one

semantic patch:

@@

expression rights, r;

@@

- cap_rights_init(&rights, r)
+ cap_rights_init_one(&rights, r)


# 4d0c33be 09-Jan-2021 Jan Kokemüller <jan.kokemueller@gmail.com>

kevent(2): Bugfix for wrong EVFILT_TIMER timeouts

When using NOTE_NSECONDS in the kevent(2) API, US_TO_SBT should be
used instead of NS_TO_SBT, otherwise the timeout results are
misleading.

PR: 252539
Reviewed by: kevans, kib
Approved by: kevans
MFC after: 3 weeks


# e90afaa0 08-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

kqueue: save space by using only one func pointer for assertions


# 6fed89b1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

kern: clean up empty lines in .c and .h files


# 59dafcde 20-Apr-2020 Kyle Evans <kevans@FreeBSD.org>

kqueue: fix conversion of timer data to sbintime

This unbreaks the i386 kqueue timer tests after a recent change switched
NOTE_ABSTIME over to using microseconds. Notably, the data argument (which
holds useconds) is an int64_t, but we were passing it to timer2sbintime
which takes an intptr_t. Perhaps in a previous incarnation, intptr_t would
have made sense, but now it just leads to the timestamp getting truncated
and subsequently rejected when it no longer fits in an intptr_t.

PR: 245768
Reported by: lwhsu / CI
MFC after: 1 week


# 445faddf 14-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

kqueue: use new capsicum helpers


# 91898857 29-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Avoid relying on header pollution from sys/refcount.h.

MFC after: 3 days
Sponsored by: The FreeBSD Foundation


# e52327e3 07-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

proc: postpone proc unlock until after reporting with kqueue

kqueue would always relock immediately afterwards.

While here drop the NULL check for list itself. The list is
always allocated.

Sponsored by: The FreeBSD Foundation


# 792843c3 24-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Pass malloc flags directly through kevent(2) subroutines.

Some kevent functions have a boolean "waitok" parameter for use when
calling malloc(9). Replace them with the corresponding malloc() flags:
the desired behaviour is known at compile-time, so this eliminates a
couple of conditional branches, and makes the code easier to read.

No functional change intended.

Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18318


# 36c4960e 24-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Plug some kernel memory disclosures via kevent(2).

The kernel may register for events on behalf of a userspace process,
in which case it must be careful to zero the kevent struct that will be
copied out to userspace.

Reviewed by: kib
MFC after: 3 days
Security: kernel stack memory disclosure
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18317


# a2afae52 24-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Ensure that knotes do not get registered when KQ_CLOSING is set.

KQ_CLOSING is set before draining the knotes associated with a kqueue,
so we must ensure that new knotes are not added after that point. In
particular, some kernel facilities may register for events on behalf
of a userspace process and race with a close of the kqueue.

PR: 228858
Reviewed by: kib
Tested by: pho
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18316


# 1eeab857 24-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Lock the knlist before releasing the in-flux state in knote_fork().

Otherwise there is a window, before iteration is resumed, during which
the knote may be freed. The in-flux state ensures that the knote will
not be removed from the knlist while locks are dropped.

PR: 228858
Reviewed by: kib
Tested by: pho
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18316


# 96fdfb36 23-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Honour the waitok parameter in kevent_expand().

Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18316


# d5e494fe 21-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Avoid unsynchronized updates to kn_status.

kn_status is protected by the kqueue's lock, but we were updating it
without the kqueue lock held. For EVFILT_TIMER knotes, there is no
knlist lock, so the knote activation could occur during the kn_status
update and result in KN_QUEUED being lost, in which case we'd enqueue
an already-enqueued knote, corrupting the queue.

Fix the problem by setting or clearing KN_DISABLED before dropping the
kqueue lock to call into the filter. KN_DISABLED is used only by the
core kevent code, so there is no side effect from setting it earlier.

Reported and tested by: Sylvain GALLIANO <sg@efficientip.com>
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18060


# 45aecd04 21-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Remove KN_HASKQLOCK.

It is a write-only flag whose last use was removed in r302235.

No functional change intended.

Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18059


# 95c05062 27-Jul-2018 David Bright <dab@FreeBSD.org>

Allow a EVFILT_TIMER kevent to be updated.

If a timer is updated (re-added) with a different time period
(specified in the .data field of the kevent), the new time period has
no effect; the timer will not expire until the original time has
elapsed. This violates the documented behavior as the kqueue(2) man
page says (in part) "Re-adding an existing event will modify the
parameters of the original event, and not result in a duplicate
entry."

This modification, adapted from a patch submitted by cem@ to PR214987,
fixes the kqueue system to allow updating a timer entry. The
kevent timer behavior is changed to:

* When a timer is re-added, update the timer parameters to and
re-start the timer using the new parameters.
* Allow updating both active and already expired timers.
* When the timer has already expired, dequeue any undelivered events
and clear the count of expirations.

All of these changes address the original PR and also bring the
FreeBSD and macOS kevent timer behaviors into agreement.

A few other changes were made along the way:

* Update the kqueue(2) man page to reflect the new timer behavior.
* Fix man page style issues in kqueue(2) diagnosed by igor.
* Update the timer libkqueue system test to test for the updated
timer behavior.
* Fix the (test) libkqueue common.h file so that it includes
config.h which defines various HAVE_* feature defines, before the
#if tests for such variables in common.h. This enables the use of
the actual err(3) family of functions.
* Fix the usages of the err(3) functions in the tests for incorrect
type of variables. Those were formerly undiagnosed due to the
disablement of the err(3) functions (see previous bullet point).

PR: 214987
Reported by: Brian Wellington <bwelling@xbill.org>
Reviewed by: kib
MFC after: 1 week
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D15778


# 1c0336c1 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

kevent: annotate unused stack local


# ec8d2335 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

filt_timerdetach: only assign to old if we're going to check it in
a KASSERT


# cbd92ce6 09-May-2018 Matt Macy <mmacy@FreeBSD.org>

Eliminate the overhead of gratuitous repeated reinitialization of cap_rights

- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.

Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# 8a36da99 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# ffb66079 24-Nov-2017 John Baldwin <jhb@FreeBSD.org>

Decode kevent structures logged via ktrace(2) in kdump.

- Add a new KTR_STRUCT_ARRAY ktrace record type which dumps an array of
structures.

The structure name in the record payload is preceded by a size_t
containing the size of the individual structures. Use this to
replace the previous code that dumped the kevent arrays dumped for
kevent(). kdump is now able to decode the kevent structures rather
than dumping their contents via a hexdump.

One change from before is that the 'changes' and 'events' arrays are
not marked with separate 'read' and 'write' annotations in kdump
output. Instead, the first array is the 'changes' array, and the
second array (only present if kevent doesn't fail with an error) is
the 'events' array. For kevent(), empty arrays are denoted by an
entry with an array containing zero entries rather than no record.

- Move kevent decoding tables from truss to libsysdecode.

This adds three new functions to decode members of struct kevent:
sysdecode_kevent_filter, sysdecode_kevent_flags, and
sysdecode_kevent_fflags.

kdump uses these helper functions to pretty-print kevent fields.

- Move structure definitions for freebsd11 and freebsd32 kevent
structures to <sys/event.h> so that they can be shared with userland.
The 32-bit structures are only exposed if _WANT_KEVENT32 is defined.
The freebsd11 structures are only exposed if _WANT_FREEBSD11_KEVENT is
defined. The 32-bit freebsd11 structure requires both.

- Decode freebsd11 kevent structures in truss for the compat11.kevent()
system call.

- Log 32-bit kevent structures via ktrace for 32-bit compat kevent()
system calls.

- While here, constify the 'void *data' argument to ktrstruct().

Reviewed by: kib (earlier version)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12470


# 6e1619da 11-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

Add pfind_any

It looks for both regular and zombie processes. This avoids allproc relocking
previously seen with pfind -> zpfind calls.


# d1372788 29-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Do not cast struct kevent_args or struct freebsd11_kevent_args to
struct g_kevent_args.

On some architectures, e.g. PowerPC, there is additional padding in uap.

Reported and tested by: andreast
Sponsored by: The FreeBSD Foundation


# 2b34e843 16-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Add abstime kqueue(2) timers and expand struct kevent members.

This change implements NOTE_ABSTIME flag for EVFILT_TIMER, which
specifies that the data field contains absolute time to fire the
event.

To make this useful, data member of the struct kevent must be extended
to 64bit. Using the opportunity, I also added ext members. This
changes struct kevent almost to Apple struct kevent64, except I did
not changed type of ident and udata, the later would cause serious API
incompatibilities.

The type of ident was kept uintptr_t since EVFILT_AIO returns a
pointer in this field, and e.g. CHERI is sensitive to the type
(discussed with brooks, jhb).

Unlike Apple kevent64, symbol versioning allows us to claim ABI
compatibility and still name the new syscall kevent(2). Compat shims
are provided for both host native and compat32.

Requested by: bapt
Reviewed by: bapt, brooks, ngie (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D11025


# f2eb97b2 16-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Style.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D11025


# 01feb4c3 14-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Use designated initializers for kevent_copyops.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 67d0b0ea 14-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Hide kev_iovlen() definition under #ifdef KTRACE, fixing build of
kernel configs without KTRACE.

Reported by: rpokala
Sponsored by: The FreeBSD Foundation
MFC after: 4 days


# 1e4296c9 12-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Ktracing kevent(2) calls with unusual arguments might leads to an
overly large allocation requests.

When ktrace-ing io, sys_kevent() allocates memory to copy the
requested changes and reported events. Allocations are sized by the
incoming syscall lengths arguments, which are user-controlled, and
might cause overflow in calculations or too large allocations.

Since io trace chunks are limited by ktr_geniosize, there is no sense
it even trying to satisfy unbounded allocations. Export ktr_geniosize
and clamp the buffers sizes in advance.

PR: 217435
Reported by: Tim Newsham <tim.newsham@nccgroup.trust>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 7d03ff1f 16-Jan-2017 Hiren Panchasara <hiren@FreeBSD.org>

Add kevent EVFILT_EMPTY for notification when a client has received all data
i.e. everything outstanding has been acked.

Reviewed by: bz, gnn (previous version)
MFC after: 3 days
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D9150


# b5442eba 01-Jan-2017 Mark Johnston <markj@FreeBSD.org>

Factor out instances of a knote detach followed by a knote_drop() call.

Reviewed by: kib (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9015


# fd30dd7c 26-Dec-2016 Konstantin Belousov <kib@FreeBSD.org>

Make knote KN_INFLUX state counted. This is final fix for the issue
closed by r310302 for knote().

If KN_INFLUX | KN_SCAN flags are set for the note passed to knote() or
knote_fork(), i.e. the knote is scanned, we might erronously clear
INFLUX when finishing notification. For normal knote() it was fixed
in r310302 simply by remembering the fact that we do not own
KN_INFLUX, since there we own knlist lock and scan thread cannot clear
KN_INFLUX until we drop the lock. For knote_fork(), the situation is
more complicated, e must drop knlist lock AKA the process lock, since
we need to register new knotes.

Change KN_INFLUX into counter and allow shared ownership of the
in-flux state between scan and knote_fork() or knote(). Both in-flux
setters need to ensure that knote is not dropped in parallel. Added
assert about kn_influx == 1 in knote_drop() verifies that in-flux state
is not shared when knote is destroyed.

Since KBI of the struct knote is changed by addition of the int
kn_influx field, reorder kn_hook and kn_hookid to fill pad on LP64
arches [1]. This keeps sizeof(struct knote) to same 128 bytes as it
was before addition of kn_influx, on amd64.

Reviewed by: markj
Suggested by: markj [1]
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D8898


# 5c36b2e8 26-Dec-2016 Konstantin Belousov <kib@FreeBSD.org>

Change knlist_destroy() to assert that knlist is empty instead of
accepting the wrong state and printing warning. Do not obliterate
kl_lock and kl_unlock pointers, they are often useful for post-mortem
analysis.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
X-Differential revision: https://reviews.freebsd.org/D8898


# 34311568 26-Dec-2016 Konstantin Belousov <kib@FreeBSD.org>

Style.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D8898


# fc05543f 25-Dec-2016 Konstantin Belousov <kib@FreeBSD.org>

Some optimizations for kqueue timers.

There is no need to do two allocations per kqueue timer. Gather all
data needed by the timer callout into the structure and allocate it at
once.

Use the structure to preserve the result of timer2sbintime(), to not
perform repeated 64bit calculations in callout.

Remove tautological casts.
Remove now unused p_nexttime [1].

Noted by: markj [1]
Reviewed by: markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-MFC note: do not remove p_nexttime
Differential revision: https://reviews.freebsd.org/D8901


# 7611b728 25-Dec-2016 Konstantin Belousov <kib@FreeBSD.org>

Some style.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D8901


# 4afd808b 19-Dec-2016 Konstantin Belousov <kib@FreeBSD.org>

Do not clear KN_INFLUX when not owning influx state.

For notes in KN_INFLUX|KN_SCAN state, the influx bit is set by a
parallel scan. When knote() reports event for the vnode filters,
which require kqueue unlocked, it unconditionally sets and then clears
influx to keep note around kqueue unlock. There, do not clear influx
flag if a scan set it, since we do not own it, instead we prevent scan
from executing by holding knlist lock.

The knote_fork() function has somewhat similar problem, it might set
KN_INFLUX for scanned note, drop kqueue and list locks, and then clear
the flag after relock. A solution there would be different enough, as
well as the test program, so close the reported issue first.

Reported and test case provided by: yjh0502@gmail.com
PR: 214923
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 69baec36 16-Dec-2016 Konstantin Belousov <kib@FreeBSD.org>

Switch from stdatomic.h to atomic.h for kernel.

Apparently stdatomic.h implementation for gcc 4.2 on sparc64 does not
work properly. This effectively reverts r251803.

Reported and tested by: lidl
Discussed with: ed
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 86f11463 16-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

Another issue reported on http://seclists.org/oss-sec/2016/q3/68 is
that struct kevent member ident has uintptr_t type, which is silently
truncated to int in the call to fget(). Explicitely check for the
valid range.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# e18ee495 01-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

When a process knote was attached to the process which is already exiting,
the knote is activated immediately. If the exit1() later activates
knotes, such knote is attempted to be activated second time. Detect
the condition by zeroed kn_ptr.p_proc pointer, and avoid excessive
activation.

Before r302235, such knotes were removed from the knlist immediately
upon activation.

Reported by: truckman
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)


# 9eb3f143 27-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix userspace build after r302235: do not expose bool field of the
structure, change it to int.

The real fix is to sanitize user-visible definitions in sys/event.h,
e.g. the affected struct knlist is of no use for userspace programs.

Reported and tested by: jkim
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)


# 9e590ff0 27-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

When filt_proc() removes event from the knlist due to the process
exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen:
- knote kn_knlist pointer is reset
- INFLUX knote is removed from the process knlist.
And, there are two consequences:
- KN_LIST_UNLOCK() on such knote is nop
- there is nothing which would block exit1() from processing past the
knlist_destroy() (and knlist_destroy() resets knlist lock pointers).
Both consequences result either in leaked process lock, or
dereferencing NULL function pointers for locking.

Handle this by stopping embedding the process knlist into struct proc.
Instead, the knlist is allocated together with struct proc, but marked
as autodestroy on the zombie reap, by knlist_detach() function. The
knlist is freed when last kevent is removed from the list, in
particular, at the zombie reap time if the list is empty. As result,
the knlist_remove_inevent() is no longer needed and removed.

Other changes:

In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events
from kn_sfflags for knote registered by kernel to only get NOTE_CHILD
notifications. The flags leak resulted in excessive
NOTE_EXEC/NOTE_FORK reports.

Fix immediate note activation in filt_procattach(). Condition should
be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT
report for the exiting process.

In knote_fork(), do not perform racy check for KN_INFLUX before kq
lock is taken. Besides being racy, it did not accounted for notes
just added by scan (KN_SCAN).

Some minor and incomplete style fixes.

Analyzed and tested by: Eric Badger <eric@badgerio.us>
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D6859


# c5e44d6c 24-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Silence false LOR report due to the taskqueue mutex and kqueue lock
named the same.

Reported by: Doug Luce <doug@freebsd.con.com>
Sponsored by: The FreeBSD Foundation


# 5405e7e2 12-Mar-2016 Justin T. Gibbs <gibbs@FreeBSD.org>

Provide high precision conversion from ns,us,ms -> sbintime in kevent

In timer2sbintime(), calculate the second and fractional second portions of
the sbintime separately. When calculating the the fractional second portion,
use a 64bit multiply to prevent excess truncation. This avoids the ~7% error
in the original conversion for ns, and smaller errors of the same type for us
and ms.

PR: 198139
Reviewed by: jhb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D5397


# 88c2beac 18-Feb-2016 Mark Johnston <markj@FreeBSD.org>

Ensure that we test the event condition when a disabled kevent is enabled.

r274560 modified kqueue_register() to only test the event condition if the
corresponding knote is not disabled. However, this check takes place before
the EV_ENABLE flag is used to clear the KN_DISABLED flag on the knote, so
enabling a previously-disabled kevent would not result in a notification for
a triggered event. This change fixes the problem by testing for EV_ENABLED
before possibly checking the event condition.

This change also updates a kqueue regression test to exercise this case.

PR: 206368
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5307


# fe169828 18-Feb-2016 Mark Johnston <markj@FreeBSD.org>

Return an error if both EV_ENABLE and EV_DISABLE are specified for a kevent.

Currently, this combination results in EV_DISABLE being ignored.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5307


# 0e3d6ed4 28-Jan-2016 Eric van Gyzen <vangyzen@FreeBSD.org>

kqueue EVFILT_PROC: avoid collision between NOTE_CHILD and NOTE_EXIT

NOTE_CHILD and NOTE_EXIT return something in kevent.data: the parent
pid (ppid) for NOTE_CHILD and the exit status for NOTE_EXIT.
Do not let the two events be combined, since one would overwrite
the other's data.

PR: 180385
Submitted by: David A. Bright <david_a_bright@dell.com>
Reviewed by: jhb
MFC after: 1 month
Sponsored by: Dell Inc.
Differential Revision: https://reviews.freebsd.org/D4900


# 3c44a349 22-Sep-2015 Mateusz Guzik <mjg@FreeBSD.org>

kqueue: simplify kern_kqueue by not refing/unrefing creds too early

No functional changes.


# 6ae26d06 01-Sep-2015 Konstantin Belousov <kib@FreeBSD.org>

Exit notification for EVFILT_PROC removes knote from the knlist. In
particular, this invalidates the knote kn_link linkage, making the
SLIST_FOREACH() loop accessing undefined values (e.g. trashed by
QUEUE_MACRO_DEBUG). If the knote is freed by other thread when kq
lock is released or when influx is cleared, e.g. by knote_scan() for
kqueue owning the knote, the iteration step would access freed memory.

Use SLIST_FOREACH_SAFE() to fix iteration.

Diagnosed by: avg
Tested by: avg, lstewart, pawel
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 78b9afe1 01-Sep-2015 Konstantin Belousov <kib@FreeBSD.org>

Clean up the kqueue use of the uma KPI.

Explain why it is fine to not check for M_NOWAIT failures in
kqueue_register(). Remove unneeded check for NULL result from
waitable allocation in kqueue_scan(). uma_free(9) handles NULL
argument correctly, remove checks for NULL. Remove useless cast and
adjust style in knote_alloc().

Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 880e2c6c 12-Aug-2015 Ed Schouten <ed@FreeBSD.org>

Perform cleanups in response to D3307.

- Document the kern_kevent_anonymous() function.
- Add assertions to ensure that we don't silently leave the kqueue
linked from a file descriptor table.

Reviewed by: jmg
Differential Revision: https://reviews.freebsd.org/D3364


# e26f6b5f 11-Aug-2015 Ed Schouten <ed@FreeBSD.org>

Add support for anonymous kqueues.

CloudABI's polling system calls merge the concept of one-shot polling
(poll, select) and stateful polling (kqueue). They share the same data
structures.

Extend FreeBSD's kqueue to provide support for waiting for events on an
anonymous kqueue. Unlike stateful polling, there is no need to support
timeouts, as an additional timer event could be used instead.
Furthermore, it makes no sense to use a different number of input and
output kevents. Merge this into a single argument.

Obtained from: https://github.com/NuxiNL/freebsd
Differential Revision: https://reviews.freebsd.org/D3307


# a2034cc9 05-Aug-2015 Ed Schouten <ed@FreeBSD.org>

Allow the creation of kqueues with a restricted set of Capsicum rights.

On CloudABI we want to create file descriptors with just the minimal set
of Capsicum rights in place. The reason for this is that it makes it
easier to obtain uniform behaviour across different operating systems.

By explicitly whitelisting the operations, we can return consistent
error codes, but also prevent applications from depending OS-specific
behaviour.

Extend kern_kqueue() to take an additional struct filecaps that is
passed on to falloc_caps(). Update the existing consumers to pass in
NULL.

Differential Revision: https://reviews.freebsd.org/D3259


# b4490c6e 18-Jul-2015 Konstantin Belousov <kib@FreeBSD.org>

The si_status field of the siginfo_t, provided by the waitid(2) and
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).

Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig. Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig. p_xexit contains complete status
and copied out into si_status.

Requested by: Joerg Schilling
Reviewed by: jilles (previous version), pho
Tested by: pho
Sponsored by: The FreeBSD Foundation


# f6f6d240 10-Jun-2015 Mateusz Guzik <mjg@FreeBSD.org>

Implement lockless resource limits.

Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.


# 7236f2c2 24-May-2015 Dmitry Chagin <dchagin@FreeBSD.org>

For future use in the Linuxulator:

1. Add a kern_kqueue() counterpart for kqueue() with flags parameter.

2. Be a bit secure. To avoid a double fp lookup add a kern_kevent_fp()
counterpart for kern_kevent() with file pointer parameter instead
of file descriptor an pass the buck to it.

Suggested by: mjg [2]

Differential Revision: https://reviews.freebsd.org/D1091
Reviewed by: trasz


# fd90e2ed 22-May-2015 Jung-uk Kim <jkim@FreeBSD.org>

CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks


# 2c30bc1f 15-Nov-2014 John-Mark Gurney <jmg@FreeBSD.org>

prevent doing filter ops locking for staticly compiled filter ops...
This significantly reduces lock contention when adding/removing knotes
on busy multi-kq system... Next step is to cache these references per
kq.. i.e. kq refs it once and keeps a local ref count so that the same
refs don't get accessed by many cpus...

only allocate a knote when we might use it...

Add a new flag, _FORCEONESHOT.. This allows a thread to force the
delivery of another event in a safe manner, say waking up an idle http
connection to force it to be reaped...

If we are _DISABLE'ing a knote, don't bother to call f_event on it, it's
disabled, so won't be delivered anyways..

Tested by: adrian


# 41e8f7ef 04-Oct-2014 Ian Lepore <ian@FreeBSD.org>

Make kevent(2) periodic timer events more reliably periodic. The event
callout is now scheduled using the C_ABSOLUTE flag, and the absolute time
of each event is calculated as the time the previous event was scheduled
for plus the interval. This ensures that latency in processing a given
event doesn't perturb the arrival time of any subsequent events.

Reviewed by: jhb


# 9696feeb 22-Sep-2014 John Baldwin <jhb@FreeBSD.org>

Add a new fo_fill_kinfo fileops method to add type-specific information to
struct kinfo_file.
- Move the various fill_*_info() methods out of kern_descrip.c and into the
various file type implementations.
- Rework the support for kinfo_ofile to generate a suitable kinfo_file object
for each file and then convert that to a kinfo_ofile structure rather than
keeping a second, different set of code that directly manipulates
type-specific file information.
- Remove the shm_path() and ksem_info() layering violations.

Differential Revision: https://reviews.freebsd.org/D775
Reviewed by: kib, glebius (earlier version)


# 2d69d0dc 12-Sep-2014 John Baldwin <jhb@FreeBSD.org>

Fix various issues with invalid file operations:
- Add invfo_rdwr() (for read and write), invfo_ioctl(), invfo_poll(),
and invfo_kqfilter() for use by file types that do not support the
respective operations. Home-grown versions of invfo_poll() were
universally broken (they returned an errno value, invfo_poll()
uses poll_no_poll() to return an appropriate event mask). Home-grown
ioctl routines also tended to return an incorrect errno (invfo_ioctl
returns ENOTTY).
- Use the invfo_*() functions instead of local versions for
unsupported file operations.
- Reorder fileops members to match the order in the structure definition
to make it easier to spot missing members.
- Add several missing methods to linuxfileops used by the OFED shim
layer: fo_write(), fo_truncate(), fo_kqfilter(), and fo_stat(). Most
of these used invfo_*(), but a dummy fo_stat() implementation was
added.


# 42e62eca 18-Jul-2014 Baptiste Daroussin <bapt@FreeBSD.org>

Extend kqueue's EVFILT_TIMER by adding precision unit flags support

Define the precision macros as bits sets to conform with XNU equivalent.
Test fflags passed for EVFILT_TIMER and return EINVAL in case an invalid flag
is passed.

Phabric: https://phabric.freebsd.org/D421
Reviewed by: kib


# 4bc38a5a 12-Apr-2014 Davide Italiano <davide@FreeBSD.org>

Hide internal details of sbintime_t implementation wrapping INT64_MAX into
SBT_MAX, to make it more robust in case internal type representation will
change in the future. All the consumers were migrated to SBT_MAX and
every new consumer (if any) should from now use this interface.

Requested by: bapt, jmg, Ryan Lortie (implictly)
Reviewed by: mav, bde


# 38219d6a 07-Apr-2014 Ed Schouten <ed@FreeBSD.org>

Implement kqueue(2) for procdesc(4).

kqueue(2) already supports EVFILT_PROC. Add an EVFILT_PROCDESC that
behaves the same, but operates on a procdesc(4) instead. Only implement
NOTE_EXIT for now. The nice thing about NOTE_EXIT is that it also
returns the exit status of the process, meaning that we can now obtain
this value, even if pdwait4(2) is still unimplemented.

Notes:

- Simply reuse EVFILT_NETDEV for EVFILT_PROCDESC. As both of these will
be used on totally different descriptor types, this should not clash.

- Let procdesc_kqops_event() reuse the same structure as filt_proc().
The only difference is that procdesc_kqops_event() should also be able
to deal with the case where the process was already terminated after
registration. Simply test this when hint == 0.

- Fix some style(9) issues in filt_proc() to keep it consistent with the
newly added procdesc_kqops_event().

- Save the exit status of the process in pd->pd_xstat, as we cannot pick
up the proctree_lock from within procdesc_kqops_event().

Discussed on: arch@
Reviewed by: kib@


# 1a5edcf8 05-Apr-2014 Konstantin Belousov <kib@FreeBSD.org>

When KN_INFLUX is set on the knote due to kqueue_register() or
kqueue_scan() unlocking the kqueue to call f_event, knote() or
knote_fork() should not skip the knote. The knote is not going to
disappear during the influx time, and the mutual exclusion between
scan and knote() is ensured by both code pathes taking knlist lock.
The race appears since knlist lock is before kq lock, so KN_INFLUX
must be set, kq lock must be dropped and only then knlist lock can be
taken. The window between kq unlock and knlist lock causes lost
events.

Add a flag KN_SCAN to indicate that KN_INFLUX is set in a manner safe
for the knote(), and check for it to ignore KN_INFLUX in the knote*()
as needed. Also, in knote(), remove the lockless check for the
KN_INFLUX flag, which could also result in the lost notification.

Reported and tested by: Kohji Okuno <okuno.kohji@jp.panasonic.com>
Discussed with: jmg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 4a144410 16-Mar-2014 Robert Watson <rwatson@FreeBSD.org>

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after: 3 weeks


# fda21f4d 16-Jan-2014 Adrian Chadd <adrian@FreeBSD.org>

Add in a default initialiser for the EVOPS_SENDFILE kqueue filterops.

Sponsored by: Netflix, Inc.


# faa9b054 06-Jan-2014 Adrian Chadd <adrian@FreeBSD.org>

Add a compile-time control over the size of KN_HASHSIZE.

This is needed for applications that use a lot of non-filedescriptor
knotes.

MFC after: 1 week
Sponsored by: Netflix, Inc.


# 774e8d90 19-Dec-2013 Stefan Eßer <se@FreeBSD.org>

Fix compilation on 32 bit architectures and use INT64_MAX instead of
LONG_MAX for the upper bound check.


# 53d5cc25 19-Dec-2013 Stefan Eßer <se@FreeBSD.org>

Fix overflow for timeout values of more than 68 years, which is the maximum
covered by sbintime (LONG_MAX seconds).

Some programs use timeout values in excess of 1000 years. The conversion
to sbintime caused wrap-around on overflow, which resulted in short or
negative timeout values. This caused long delays on sockets opened by
affected programs (e.g. OpenSSH).

Kernels compiled without -fno-strict-overflow were not affected, apparently
because the compiler tested the sign of the timeout value before performing
the multiplication that lead to overflow.

When the -fno-strict-overflow option was added to CFLAGS, this optimization
was disabled and the test was performed on the result of the multiplication.
Negative products were caught and resulted in EINVAL being returned, but
wrap-around to positive values just shortened the timeout value to the
residue of the result that could be represented by sbintime.

The fix is to cap the timeout values at the maximum that can be represented
by sbintime, which is 2^31 - 1 seconds or more than 68 years.

After this change, the kernel can be compiled with -fno-strict-overflow
with no ill effects.

MFC after: 3 days


# ed5848c8 15-Nov-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Replace CAP_POLL_EVENT and CAP_POST_EVENT capability rights (which I had
a very hard time to fully understand) with much more intuitive rights:

CAP_EVENT - when set on descriptor, the descriptor can be monitored
with syscalls like select(2), poll(2), kevent(2).

CAP_KQUEUE_EVENT - When set on a kqueue descriptor, the kevent(2)
syscall can be called on this kqueue to with the eventlist
argument set to non-NULL value; in other words the given
kqueue descriptor can be used to monitor other descriptors.
CAP_KQUEUE_CHANGE - When set on a kqueue descriptor, the kevent(2)
syscall can be called on this kqueue to with the changelist
argument set to non-NULL value; in other words it allows to
modify events monitored with the given kqueue descriptor.

Add alias CAP_KQUEUE, which allows for both CAP_KQUEUE_EVENT and
CAP_KQUEUE_CHANGE.

Add backward compatibility define CAP_POLL_EVENT which is equal to CAP_EVENT.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 1947c8a6 03-Nov-2013 Jilles Tjoelker <jilles@FreeBSD.org>

kqueue: Change error for kqueues rlimit from EMFILE to ENOMEM and document
this error condition in the kqueue(2) manual page.

Discussed with: kib


# 9110db81 21-Oct-2013 Konstantin Belousov <kib@FreeBSD.org>

Add a resource limit for the total number of kqueues available to the
user. Kqueue now saves the ucred of the allocating thread, to
correctly decrement the counter on close.

Under some specific and not real-world use scenario for kqueue, it is
possible for the kqueues to consume memory proportional to the square
of the number of the filedescriptors available to the process. Limit
allows administrator to prevent the abuse.

This is kernel-mode side of the change, with the user-mode enabling
commit following.

Reported and tested by: pho
Discussed with: jmg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 9d2abcd0 26-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Do not allow negative timeouts for kqueue timers, check for the
negative timeout both before and after the conversion to sbintime_t.

For periodic kqueue timer, convert zero timeout into 1ms, to avoid
interrupt storm on fast event timers.

Reported and tested by: pho
Discussed with: mav
Reviewed by: davide
Sponsored by: The FreeBSD Foundation
Approved by: re (marius)


# 19f6a6a1 22-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Pre-acquire the filedesc sx when a possibility exists that the later
code could need to remove a kqueue from the filedesc list. Global
lock is already locked, which causes sleepable after non-sleepable
lock acquisition.

Reported and tested by: pho
Reviewed by: jmg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)


# b12698e1 18-Sep-2013 Roman Divacky <rdivacky@FreeBSD.org>

Revert r255672, it has some serious flaws, leaking file references etc.

Approved by: re (delphij)


# 253c75c0 18-Sep-2013 Roman Divacky <rdivacky@FreeBSD.org>

Implement epoll support in Linuxulator. This is a tiny wrapper around kqueue
to implement epoll subset of functionality. The kqueue user data are 32bit
on i386 which is not enough for epoll user data so this patch overrides
kqueue fileops to maintain enough space in struct file.

Initial patch developed by me in 2007 and then extended and finished
by Yuri Victorovich.

Approved by: re (delphij)
Sponsored by: Google Summer of Code
Submitted by: Yuri Victorovich <yuri at rawbw dot com>
Tested by: Yuri Victorovich <yuri at rawbw dot com>


# e8de242d 13-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Use TAILQ instead of STAILQ for kqeueue filedescriptors to ensure constant
time removal on kqueue close.

Reported and tested by: pho
Reviewed by: jmg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (delphij)


# 7008be5b 04-Sep-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


# 57150ff6 26-Aug-2013 John-Mark Gurney <jmg@FreeBSD.org>

fix up some comments and a white space issue...

MFC after: 3 days


# ca04d21d 15-Aug-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by: kib
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# e05bf4cf 13-Aug-2013 John Baldwin <jhb@FreeBSD.org>

Some small cleanups to the fixes in r180340:
- Set NOTE_TRACKERR before running filt_proc(). If the knote did not
have NOTE_FORK set in fflags when registered, then the TRACKERR event
could miss being posted.
- Don't pass the pid in to filt_proc() for NOTE_FORK events. The special
handling for pids is done knote_fork() directly and no longer in
filt_proc().

MFC after: 2 weeks


# 5b596f0f 07-Aug-2013 John Baldwin <jhb@FreeBSD.org>

Don't emit a spurious EVFILT_PROC event with no fflags set on process exit
if NOTE_EXIT is not being monitored. The rationale is that a listener
should only get an event for exit() if they registered interest via
NOTE_EXIT. This matches the behavior on OS X.
- Don't save the exit status on process exit unless NOTE_EXIT is being
monitored.
- Add an internal EV_DROP flag that requests kqueue_scan() to free the
knote without signalling it to userland and use this when a process
exits but the fflags in the knote is zero.

Reviewed by: jmg
MFC after: 1 month


# 2381f6ef 16-Jun-2013 Ed Schouten <ed@FreeBSD.org>

Change callout use counter to use C11 atomics.

In order to get some coverage of C11 atomics in kernelspace, switch at
least one piece of code in kernelspace to use C11 atomics instead of
<machine/atomic.h>.

While there, slightly improve the code by adding an assertion to prevent
the use count from going negative.


# 21a37a71 09-Mar-2013 Alexander Motin <mav@FreeBSD.org>

Rework overflow checks of r247898 to not let too "intelligent" compiler to
optimize it out.

Submitted by: bde


# 836972b8 07-Mar-2013 Alexander Motin <mav@FreeBSD.org>

Fix off-by-one error in nanoseconds validation.

Submitted by: bde


# 980c545d 06-Mar-2013 Alexander Motin <mav@FreeBSD.org>

Fix time math overflows and improve zero intervals handling in poll(),
select(), nanosleep() and kevent() functions after calloutng changes.

Reported by: bde


# 40e794ab 04-Mar-2013 Davide Italiano <davide@FreeBSD.org>

MFcalloutng:
- Rewrite kevent() timeout implementation to allow sub-tick precision.
- Make the interval timings for EVFILT_TIMER more accurate. This also
removes an hack introduced in r238424.

Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil


# 29196684 13-Jul-2012 John Baldwin <jhb@FreeBSD.org>

Make the interval timings for EVFILT_TIMER more accurate. tvtohz() always
adds an extra tick to account for the current partial clock tick. However,
that is not appropriate for a repeating timer when the exact tvtohz() value
should be used for subsequent intervals. Fix repeating callouts for
EVFILT_TIMER by subtracting 1 tick from the tvtohz() result similar to the
fix used in realitexpire() for interval timers.

While here, update a few comments to note that if the EVFILT_TIMER code
were to move out of kern_event.c, it should move to kern_time.c (where the
interval timer code it mimics lives) rather than kern_timeout.c.

MFC after: 1 month


# a79de683 14-Jun-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Update comment.

MFC after: 1 month


# b25711e6 26-Mar-2012 Alexander V. Chernikov <melifaro@FreeBSD.org>

- Add knlist_init_rw_reader() function to kqueue(9).
Function acquired reader lock if needed.
Assert check for reader or writer lock (RA_LOCKED / RA_UNLOCKED)
- While here, add knlist_init_mtx.9 to MLINKS and fix some style(9) issues

Reviewed by: glebius
Approved by: ae(mentor)

MFC after: 2 weeks


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 6aba400a 25-Aug-2011 Attilio Rao <attilio@FreeBSD.org>

Fix a deficiency in the selinfo interface:
If a selinfo object is recorded (via selrecord()) and then it is
quickly destroyed, with the waiters missing the opportunity to awake,
at the next iteration they will find the selinfo object destroyed,
causing a PF#.

That happens because the selinfo interface has no way to drain the
waiters before to destroy the registered selinfo object. Also this
race is quite rare to get in practice, because it would require a
selrecord(), a poll request by another thread and a quick destruction
of the selrecord()'ed selinfo object.

Fix this by adding the seldrain() routine which should be called
before to destroy the selinfo objects (in order to avoid such case),
and fix the present cases where it might have already been called.
Sometimes, the context is safe enough to prevent this type of race,
like it happens in device drivers which installs selinfo objects on
poll callbacks. There, the destruction of the selinfo object happens
at driver detach time, when all the filedescriptors should be already
closed, thus there cannot be a race.
For this case, mfi(4) device driver can be set as an example, as it
implements a full correct logic for preventing this from happening.

Sponsored by: Sandvine Incorporated
Reported by: rstone
Tested by: pluknet
Reviewed by: jhb, kib
Approved by: re (bz)
MFC after: 3 weeks


# 9c00bb91 16-Aug-2011 Konstantin Belousov <kib@FreeBSD.org>

Add the fo_chown and fo_chmod methods to struct fileops and use them
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.

Based on the submission by: glebius
Reviewed by: rwatson
Approved by: re (bz)


# d1b6899e 12-Aug-2011 Jonathan Anderson <jonathan@FreeBSD.org>

Rename CAP_*_KEVENT to CAP_*_EVENT.

Change the names of a couple of capability rights to be less
FreeBSD-specific.

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc


# a9d2f8d8 10-Aug-2011 Robert Watson <rwatson@FreeBSD.org>

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


# 1fe80828 01-Apr-2011 Konstantin Belousov <kib@FreeBSD.org>

After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9)
and remove the falloc() version that lacks flag argument. This is done
to reduce the KPI bloat.

Requested by: jhb
X-MFC-note: do not


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 805de54c 14-Apr-2010 John Baldwin <jhb@FreeBSD.org>

MFC 205886:
Defer freeing a kevent list until after dropping kqueue locks.


# 7e3d78ae 30-Mar-2010 John Baldwin <jhb@FreeBSD.org>

Defer freeing a kevent list until after dropping kqueue locks.

LOR: 185
Submitted by: Matthew Fleming @ Isilon
MFC after: 1 week


# f723d876 17-Feb-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r203875:
Do not leak process lock when current thread is not allowed to see target.


# d07dc8e3 14-Feb-2010 Konstantin Belousov <kib@FreeBSD.org>

Do not leak process lock when current thread is not allowed to see target.

Bumped into by: ed
MFC after: 3 days


# 52c240aa 22-Jan-2010 Brooks Davis <brooks@FreeBSD.org>

MFC r201350:

The devices that supported EVFILT_NETDEV kqueue filters were removed in
r195175. Remove all definitions, documentation, and usage.

The change of function signature for vlan_link_state() was not merged to
maintain the ABI.


# 97a0b2e1 08-Jan-2010 Brooks Davis <brooks@FreeBSD.org>

MFC r201352

If a filter has already been added, actually return EEXIST when trying
at add it again.


# 7eb5db51 31-Dec-2009 Brooks Davis <brooks@FreeBSD.org>

If a filter has already been added, actually return EEXIST when trying
at add it again.

MFC after: 1 week


# a6fffd6c 31-Dec-2009 Brooks Davis <brooks@FreeBSD.org>

The devices that supported EVFILT_NETDEV kqueue filters were removed in
r195175. Remove all definitions, documentation, and usage.

fifo_misc.c:
Remove all kqueue tests as fifo_io.c performs all those that
would have remained.

Reviewed by: rwatson
MFC after: 3 weeks
X-MFC note: don't change vlan_link_state() function signature


# 83613795 31-Oct-2009 Stacey Son <sson@FreeBSD.org>

MFC 197240,197241,197242,197243,197293,197294,197407:

Add EVFILT_USER filter and EV_DISPATCH/EV_RECEIPT flags to kevent(2).

Approved by: rwatson (mentor)


# 59849206 29-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r197930:
Postpone dropping fp till both kq_global and kqueue mutexes are
unlocked.


# 47d81c1b 10-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

Postpone dropping fp till both kq_global and kqueue mutexes are
unlocked. fdrop() closes file descriptor when reference count goes to
zero. Close method for vnodes locks the vnode, resulting in "sleepable
after non-sleepable". For pipes, pipe mutex is before kqueue lock,
causing LOR.

Reported and tested by: pho
MFC after: 2 weeks


# 13dcbd75 28-Sep-2009 Xin LI <delphij@FreeBSD.org>

Use correct sizeof() object for klist 'list'. Currently, struct klist
contained only SLIST_HEAD as its member, thus sizeof(struct klist) would
equal to sizeof(struct klist *), so this change makes the code more
correct in terms of semantics, but should be a no-op to compiler at this
time.

Reported by: MQ <antinvidia at gmail com>


# 1c2825bd 22-Sep-2009 Roman Divacky <rdivacky@FreeBSD.org>

Change unsigned foo to u_foo as required by style(9).

Requested by: bde
Approved by: ed (mentor)


# 6413e27b 17-Sep-2009 Roman Divacky <rdivacky@FreeBSD.org>

Fix the style of the previous commit.

Approved by: ed (mentor, implicit)


# abc8594d 17-Sep-2009 Roman Divacky <rdivacky@FreeBSD.org>

Make these argument/variable unsigned as the defines for them don't fit
into signed 32bit integer.

Approved by: ed (mentor, implicit)
Approved by: sson


# fdc1a113 15-Sep-2009 Stacey Son <sson@FreeBSD.org>

Add EV_RECEIPT to kevents.

EV_RECEIPT is useful to disambiguating error conditions when multiple
events structures are passed to kevent(2). The error code is returned
in the data field and EV_ERROR is set.

Approved by: rwatson (co-mentor)


# 1a921c41 15-Sep-2009 Stacey Son <sson@FreeBSD.org>

Add the EV_DISPATCH flag to kevents.

When the EV_DISPATCH flag is used the event source will be disabled
immediately after the delivery of an event. This is similar to the
EV_ONESHOT flag but it doesn't delete the event.

Approved by: rwatson (co-mentor)


# 2c2e4499 15-Sep-2009 Stacey Son <sson@FreeBSD.org>

Add EVFILT_USER to kevents.

Add user events support to kernel events which are not associated with any
kernel mechanism but are triggered by user level code. This is useful for
adding user level events to an event handler that may also be monitoring
kernel events.

Approved by: rwatson (co-mentor)


# 95128e98 15-Sep-2009 Stacey Son <sson@FreeBSD.org>

Add optional touch event filter hooks to kevents.

The touch event filter is called when a kernel event data is possibly
updated. There are two hook points. First, during a kevent() system
call. Second, when an event has been triggered.

Approved by: rwatson (co-mentor)


# e76d823b 12-Sep-2009 Robert Watson <rwatson@FreeBSD.org>

Use C99 initialization for struct filterops.

Obtained from: Mac OS X
Sponsored by: Apple Inc.
MFC after: 3 weeks


# fe1d3f15 28-Jun-2009 Stanislav Sedov <stas@FreeBSD.org>

- Turn the third (islocked) argument of the knote call into flags parameter.
Introduce the new flag KNF_NOKQLOCK to allow event callers to be called
without KQ_LOCK mtx held.
- Modify VFS knote calls to always use KNF_NOKQLOCK flag. This is required
for ZFS as its getattr implementation may sleep.

Approved by: re (rwatson)
Reviewed by: kib
MFC after: 2 weeks


# d8b0556c 10-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by: jhb [1]
Noted by: jhb [2]
Reviewed by: jhb
Tested by: rnoland


# e11e3f18 23-Oct-2008 Dag-Erling Smørgrav <des@FreeBSD.org>

Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.


# 1ede983c 23-Oct-2008 Dag-Erling Smørgrav <des@FreeBSD.org>

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 7054ee4e 07-Jul-2008 Konstantin Belousov <kib@FreeBSD.org>

The kqueue_register() function assumes that it is called from the top of
the syscall code and acquires various event subsystem locks as needed.
The handling of the NOTE_TRACK for EVFILT_PROC is currently done by
calling the kqueue_register() from filt_proc() filter, causing recursive
entrance of the kqueue code. This results in the LORs and recursive
acquisition of the locks.

Implement the variant of the knote() function designed to only handle
the fork() event. It mostly copies the knote() body, but also handles
the NOTE_TRACK, removing the handling from the filt_proc(), where it
causes problems described above. The function is called from the fork1()
instead of knote().

When encountering NOTE_TRACK knote, it marks the knote as influx
and drops the knlist and kqueue lock. In this context call to
kqueue_register is safe from the problems.

An error from the kqueue_register() is reported to the observer as
NOTE_TRACKERR fflag.

PR: 108201
Reviewed by: jhb, Pramod Srinivasan <pramod juniper net> (previous version)
Discussed with: jmg
Tested by: pho
MFC after: 2 weeks


# e1a32fd4 07-Jul-2008 Konstantin Belousov <kib@FreeBSD.org>

The r178914 I erronously put the setting of the KQ_FLUXWAIT flag before
KQ_FLUX_WAKEUP(). Since the later macro clears the KQ_FLUXWAIT, the
kqueue_scan() thread may be not woken up.

Move the setting of KQ_FLUXWAIT after wakeup to correct the issue.

Reported and tested by: pho
MFC after: 3 days


# e15864ef 10-May-2008 Konstantin Belousov <kib@FreeBSD.org>

Kqueue_scan() may sleep when encountered the influx knotes. On the other
hand, it may cause other threads to sleep since kqueue_scan() may mark
some knotes as infux. This could lead to the deadlock.

Before kqueue_scan() sleeps, wakeup the threads that are waiting for the
influx knotes produced by this thread.

Tested by: pho (previous version)
Reviewed by: jmg
MFC after: 2 weeks


# 2e711e4d 10-May-2008 Konstantin Belousov <kib@FreeBSD.org>

The kqueue_close() encountering the KN_INFLUX knotes on the kq being
closed is the legitimate situation. For instance, filedescriptor with
registered events may be closed in parallel with closing the kqueue.
Properly handle the case instead of asserting that this cannot happen.

Reported and tested by: pho
Reviewed by: jmg
MFC after: 2 weeks


# e8245292 02-Apr-2008 Jeff Roberson <jeff@FreeBSD.org>

- Convert two timeout users to the new callout_reset_curcpu() api.

Sponsored by: Nokia


# 237fdd78 16-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# e4650294 07-Jan-2008 John Baldwin <jhb@FreeBSD.org>

Make ftruncate a 'struct file' operation rather than a vnode operation.
This makes it possible to support ftruncate() on non-vnode file types in
the future.
- 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on
a given file descriptor.
- ftruncate() moves to kern/sys_generic.c and now just fetches a file
object and invokes fo_truncate().
- The vnode-specific portions of ftruncate() move to vn_truncate() in
vfs_vnops.c which implements fo_truncate() for vnode file types.
- Non-vnode file types return EINVAL in their fo_truncate() method.

Submitted by: rwatson


# 397c19d1 29-Dec-2007 Jeff Roberson <jeff@FreeBSD.org>

Remove explicit locking of struct file.
- Introduce a finit() which is used to initailize the fields of struct file
in such a way that the ops vector is only valid after the data, type,
and flags are valid.
- Protect f_flag and f_count with atomic operations.
- Remove the global list of all files and associated accounting.
- Rewrite the unp garbage collection such that it no longer requires
the global list of all files and instead uses a list of all unp sockets.
- Mark sockets in the accept queue so we don't incorrectly gc them.

Tested by: kris, pho


# ace8398d 15-Dec-2007 Jeff Roberson <jeff@FreeBSD.org>

Refactor select to reduce contention and hide internal implementation
details from consumers.

- Track individual selecters on a per-descriptor basis such that there
are no longer collisions and after sleeping for events only those
descriptors which triggered events must be rescaned.
- Protect the selinfo (per descriptor) structure with a mtx pool mutex.
mtx pool mutexes were chosen to preserve api compatibility with
existing code which does nothing but bzero() to setup selinfo
structures.
- Use a per-thread wait channel rather than a global wait channel.
- Hide select implementation details in a seltd structure which is
opaque to the rest of the kernel.
- Provide a 'selsocket' interface for those kernel consumers who wish to
select on a socket when they have no fd so they no longer have to
be aware of select implementation details.

Tested by: kris
Reviewed on: arch


# d7f81adb 14-Jul-2007 Craig Rodrigues <rodrigc@FreeBSD.org>

Revert previous commits which I committed by mistake.

Approved by: re (implicit)
Pointy hat to: me


# d678780e 14-Jul-2007 Craig Rodrigues <rodrigc@FreeBSD.org>

The last entry in the ext2_opts array must be NULL,
otherwise the kernel with crash in vfs_filteropt() if an invalid
mount option is passed to ext2fs.

Approved by: re (kensmith)


# dede2ab3 28-May-2007 Robert Watson <rwatson@FreeBSD.org>

In kern_kevent(), unconditionally fdrop() fp once fget() has succeeded,
as we never have an opportunity to set it to NULL.

Found with: Coverity Prevent(tm)
CID: 2161


# 87066f04 27-May-2007 Robert Watson <rwatson@FreeBSD.org>

Select a more appealing spelling for the word acquire.


# 5e3f7694 04-Apr-2007 Robert Watson <rwatson@FreeBSD.org>

Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by: kris
Discussed with: jhb, kris, attilio, jeff


# 0c14ff0e 04-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Remove 'MPSAFE' annotations from the comments above most system calls: all
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.

Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.


# 6600b45d 20-Nov-2006 John Baldwin <jhb@FreeBSD.org>

Save exit status of an exiting process in kn_data in the knote.

Submitted by: Jared Yanovich ^phirerunner at comcast.net^
MFC after: 2 weeks


# 33fabe46 24-Sep-2006 John-Mark Gurney <jmg@FreeBSD.org>

remove unnecessary NULL check...

Coverity ID: 1545


# 4db71d27 23-Sep-2006 John-Mark Gurney <jmg@FreeBSD.org>

hide kqueue_register from public view, and replace it w/ kqfd_register...
this eliminates a possible race in aio registering a kevent..


# 9edac6f3 23-Sep-2006 John-Mark Gurney <jmg@FreeBSD.org>

add KTRACE hooks into kevent... This will help people debug their kqueue
programs to find out exactly which events were registered and which were
returned... This should be lower in kern_kevent, but that would require
special munging due to locks and the functions used to copyin/copyout
kevents...

If someone wants to teach ktrace how to output pretty kevents, I have a
kevent prety printer that can be used...


# 5c69ad83 12-Jun-2006 John Baldwin <jhb@FreeBSD.org>

Use fget() in kqueue_register() instead of doing all the work by hand.


# f420242b 02-Jun-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Don't forget to unlock kq lock in low memory situations.

OK'ed by: jmg


# 8ebab14c 02-Jun-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Remove confusing done_noglobal label. The KQ_GLOBAL_UNLOCK() macro know
how to handle both situations - when kq_global lock is and is not held.

OK'ed by: jmg


# 241321ab 02-Jun-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Use SLIST_FOREACH_SAFE() macro, because knote_drop() can free an element
which can be then used to find next element in the list.

OK'ed by: jmg


# a29b4f6e 14-Apr-2006 John Baldwin <jhb@FreeBSD.org>

Drop the kqueue global mutex as soon as we are finished with it rather
than keeping it locked until we exit the function to optimize the case
where the lock would be dropped and later reacquired. The optimization
was broken when kevent's were moved from UFS to VFS and the knote list
lock for a vnode kevent became the lockmgr vnode lock. If one tried
to use a kqueue that contained events for a kqueue fd followed by a vnode,
then the kq global lock would end up being held when the vnode lock was
acquired which could result in sleeping with a mutex held (and subsequent
panics) if the vnode lock was contested.

Reviewed by: jmg
Tested by: ps (on 6.x)
MFC after: 3 days


# 1c4ca5e5 07-Apr-2006 John-Mark Gurney <jmg@FreeBSD.org>

spell unlock correctly, this is relatively minor as it's rare someone would
provide a lock method, and want the default unlock, but it is a bug...

PR: 95356
Submitted by: Stephen Corteselli
MFC after: 3 days


# 5e612589 01-Apr-2006 John-Mark Gurney <jmg@FreeBSD.org>

mask out any action when copying the flags from the event to the knote..

Pointed out by: Václav Haisman
Submitted by: Dan Nelson (slightly modifed patch)
MFC after: 3 days


# 4e095bc0 29-Mar-2006 John-Mark Gurney <jmg@FreeBSD.org>

hold the list lock over the f_event and KNOTE_ACTIVATE calls... This closes
a race where data could come in before we clear the INFLUX flag, and get
skipped over by knote (and hence never be activated, though it should of
been)...

Found by: glebius & co.
Reviewed by: glebius
MFC after: 3 days


# 69cd28da 12-Oct-2005 Doug Ambrisko <ambrisko@FreeBSD.org>

Add in kqueue support to LIO event notification and fix how it handled
notifications when LIO operations completed. These were the problems
with LIO event complete notification:
- Move all LIO/AIO event notification into one general function
so we don't have bugs in different data paths. This unification
got rid of several notification bugs one of which if kqueue was
used a SIGILL could get sent to the process.
- Change the LIO event accounting to count all AIO request that
could have been split across the fast path and daemon mode.
The prior accounting only kept track of AIO op's in that
mode and not the entire list of operations. This could cause
a bogus LIO event complete notification to occur when all of
the fast path AIO op's completed and not the AIO op's that
ended up queued for the daemon.

Suggestions from: alc


# 19b2dff7 15-Sep-2005 Stephan Uphoff <ups@FreeBSD.org>

Fix race condition that caused activation of an event to
be ignored immediately after it was deactivated.

Found by: Yahoo!
MFC after: 3 days


# 571dcd15 01-Jul-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Fix the recent panics/LORs/hangs created by my kqueue commit by:

- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by: jmg
Approved by: re (scottl), scottl (mentor override)
Pointyhat to: ssouhlal
Will be happy: everyone


# efe5beca 03-Jun-2005 Paul Saab <ps@FreeBSD.org>

Wrap copyin/copyout for kevent so the 32bit wrapper does not have
to malloc nchanges * sizeof(struct kevent) AND/OR nevents *
sizeof(struct kevent) on every syscall.

Glanced at by: peter, jmg
Obtained from: Yahoo!
MFC after: 2 weeks


# b633f50d 24-May-2005 John-Mark Gurney <jmg@FreeBSD.org>

make stat return an zero'd struct, and be a FIFO again... This is only
to fix libc_r since it requires stat to close fd's, and so commented in
the code...

PR: threads/75795
Reviewed by: ps
MFC after: 1 week


# c4c44d29 17-Mar-2005 John-Mark Gurney <jmg@FreeBSD.org>

fix aio+kq... I've been running ambrisko's test program for much longer
w/o problems than I was before... This simply brings back the knote_delete
as knlist_delete which will also drop the knote's, instead of just clearing
the list and seeing _ONESHOT...

Fix a race where if a note was _INFLUX and _DETACHED, it could end up being
modified... whoopse..

MFC after: 1 week
Prodded by: ambrisko and dwhite


# b8a4edc1 01-Mar-2005 Paul Saab <ps@FreeBSD.org>

Use kern_kevent instead of the stackgap for 32bit syscall wrapping.

Submitted by: jhb
Tested on: amd64


# 4f7fd28e 22-Feb-2005 Robert Watson <rwatson@FreeBSD.org>

When invoking callout_init(), spell '1' as "CALLOUT_MPSAFE".

MFC after: 3 days


# c711aea6 09-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Make a bunch of malloc types static.

Found by: src/tools/tools/kernxref


# 1b5cd47a 16-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move a FILEDESC_UNLOCK upwards to silence witness.


# 124e4c3b 13-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST.

Use this in all the places where sleeping with the lock held is not
an issue.

The distinction will become significant once we finalize the exact
lock-type to use for this kind of case.


# 583ef6b6 13-Oct-2004 John-Mark Gurney <jmg@FreeBSD.org>

/me gets the wrong patch out of the pr :(
/me had the write patch w/o comments on his test system.

Pointed out by: kuriyama and ache
Pointy hat to: jmg


# d46316e8 13-Oct-2004 John-Mark Gurney <jmg@FreeBSD.org>

fix a bug where signal events didn't set the flags for attach/detach..

PR: 72234
MFC after: 2 days


# 31580e68 14-Sep-2004 John-Mark Gurney <jmg@FreeBSD.org>

unlock global lock in kqueue_scan before msleep'ing to prevent dead
lock.. we didn't unlock global lock earlier to prevent just having
to reaquire it again..

Found by: peter
Reviewed by: ps
MFC after: 3 days


# ca95b2de 09-Sep-2004 John-Mark Gurney <jmg@FreeBSD.org>

remove giant required from kqueue_close..

Reported by: kuriyama
MFC after: 3 days


# 9b90387d 06-Sep-2004 John-Mark Gurney <jmg@FreeBSD.org>

don't call f_detach if the filter has alread removed the knote.. This
happens when a proc exits, but needs to inform the user that this has
happened.. This also means we can remove the check for detached from
proc and sig f_detach functions as this is doing in kqueue now...

MFC after: 5 days


# 1c0f9af5 15-Aug-2004 Brian Feldman <green@FreeBSD.org>

Allocate the marker, when scanning a kqueue, from the "heap" instead of the
stack. When swapped out, a process's kernel stack would be unavailable,
and we could get a page fault when scanning the same kqueue.

PR: kern/61849


# ad3b9257 15-Aug-2004 John-Mark Gurney <jmg@FreeBSD.org>

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# 7d5e45a3 13-Aug-2004 John-Mark Gurney <jmg@FreeBSD.org>

looks like rwatson forgot tabs... :)


# 44f31f75 12-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Trim trailing white space.


# a6719c82 22-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Push Giant acquisition down into fo_stat() from most callers. Acquire
Giant conditional on debug.mpsafenet in the socket soo_stat() routine,
unconditionally in vn_statfile() for VFS, and otherwise don't acquire
Giant. Accept an unlocked read in kqueue_stat(), and cryptof_stat() is
a no-op. Don't acquire Giant in fstat() system call.

Note: in fdescfs, fo_stat() is called while holding Giant due to the VFS
stack sitting on top, and therefore there will still be Giant recursion
in this case.


# 1c1ce925 22-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Push acquisition of Giant from fdrop_closed() into fo_close() so that
individual file object implementations can optionally acquire Giant if
they require it:

- soo_close(): depends on debug.mpsafenet
- pipe_close(): Giant not acquired
- kqueue_close(): Giant required
- vn_close(): Giant required
- cryptof_close(): Giant required (conservative)

Notes:

Giant is still acquired in close() even when closing MPSAFE objects
due to kqueue requiring Giant in the calling closef() code.
Microbenchmarks indicate that this removal of Giant cuts 3%-3% off
of pipe create/destroy pairs from user space with SMP compiled into
the kernel.

The cryptodev and opencrypto code appears MPSAFE, but I'm unable to
test it extensively and so have left Giant over fo_close(). It can
probably be removed given some testing and review.


# a88295bb 14-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Disable SIGIO for now, leave a comment as to why it's busted and hard
to fix.


# 67543ab1 14-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Make FIOASYNC, FIOSETOWN and FIOGETOWN work on kqueues.


# 94ed9c8a 04-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a new kevent filter. EVFILT_FS that will be used to signal
generic filesystem events to userspace. Currently only mount and unmount
of filesystems are signalled. Soon to be added, up/down status of NFS.

Introduce a sysctl node used to route requests to/from filesystems
based on filesystem ids.

Introduce a new vfsop, vfs_sysctl(mp, req) that is used as the callback/
entrypoint by the sysctl code to change individual filesystems.


# 948a4734 01-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Add GIANT_REQUIRED to kqueue_close(), since kqueue currently requires
Giant.


# ec513ff7 06-Apr-2004 Colin Percival <cperciva@FreeBSD.org>

Fix filt_timer* races: Finish initializing a knote before we pass it to
a callout, and use the new callout_drain API to make sure that a callout
has finished before we deallocate memory it is using.

PR: kern/64121
Discussed with: gallatin


# fe5f3a72 19-Feb-2004 Brian Feldman <green@FreeBSD.org>

Make sure to wake up any select waiters when closing a kqueue (also, not
crash). I am fairly sure that only people with SMP and multi-threaded
apps using kqueue will be affected by this, so I have a stress-testing
program on my web site:
<URL:http://green.homeunix.org/~green/getaddrinfo-pthreads-stresstest.c>


# 1c58509c 25-Dec-2003 David Malone <dwmalone@FreeBSD.org>

Don't TAILQ_INIT kq_head twice, once is enough.


# 1a29c806 14-Nov-2003 Olivier Houchard <cognet@FreeBSD.org>

Better fix than my previous commit:
in exit1(), make sure the p_klist is empty after sending NOTE_EXIT.
The process won't report fork() or execve() and won't be able to handle
NOTE_SIGNAL knotes anyway.
This fixes some race conditions with do_tdsignal() calling knote() while
the process is exiting.

Reported by: Stefan Farfeleder <stefan@fafoe.narf.at>
MFC after: 1 week


# 512824f8 09-Nov-2003 Seigo Tanimura <tanimura@FreeBSD.org>

- Implement selwakeuppri() which allows raising the priority of a
thread being waken up. The thread waken up can run at a priority as
high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
threads. Used by selwakeuppri() if collision occurs.

Not objected in: -arch, -current


# 7922cdc8 03-Nov-2003 Olivier Houchard <cognet@FreeBSD.org>

I believe kbyanc@ really meant this in rev 1.58.
Use zpfind() to see if the process became a zombie if pfind() doesn't find it
and if the caller wants to know about process death, so that the caller knows
the process died even if it happened before the kevent was actually registered.

MFC after: 1 week


# f4400469 03-Nov-2003 Olivier Houchard <cognet@FreeBSD.org>

Do not attempt to report proc event if NOTE_EXIT has already been received.
This fixes a race condition (specifically with signal events) that could
lead to the kn being re-inserted into the list after it has been destroyed,
which is not something we want to happen.

PR: kern/58258


# e1419c08 19-Oct-2003 David Malone <dwmalone@FreeBSD.org>

falloc allocates a file structure and adds it to the file descriptor
table, acquiring the necessary locks as it works. It usually returns
two references to the new descriptor: one in the descriptor table
and one via a pointer argument.

As falloc releases the FILEDESC lock before returning, there is a
potential for a process to close the reference in the file descriptor
table before falloc's caller gets to use the file. I don't think this
can happen in practice at the moment, because Giant indirectly protects
closes.

To stop the file being completly closed in this situation, this change
makes falloc set the refcount to two when both references are returned.
This makes life easier for several of falloc's callers, because the
first thing they previously did was grab an extra reference on the
file.

Reviewed by: iedowse
Idea run past: jhb


# 7c2d2efd 18-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Initialize struct fileops with C99 sparse initialization.


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# f563420e 11-Apr-2003 Kelly Yancey <kbyanc@FreeBSD.org>

Fix race between a process registering a NOTE_EXIT EVFILT_PROC event and
the target process exiting which causes attempts to register the kevent
to randomly fail depending on whether the target runs to completion before
the parent can call kevent(2). The bug actually effects EVFILT_PROC
events on any zombie process, but the most common manifestation is with
parents trying to monitor child processes.

MFC after: 2 weeks
Sponsored by: NTT Multimedia Communications Labs


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# e7d6662f 14-Feb-2003 Alfred Perlstein <alfred@FreeBSD.org>

Do not allow kqueues to be passed via unix domain sockets.


# edf6699a 14-Feb-2003 Alfred Perlstein <alfred@FreeBSD.org>

Fix LOR with PROC/filedesc. Introduce fdesc_mtx that will be used as a
barrier between free'ing filedesc structures. Basically if you want to
access another process's filedesc, you want to hold this mutex over the
entire operation.


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 34c54d9f 20-Jan-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Rewrite the SMP filedesc locking in knote_attach() in order to
1. eliminate unnecessary loop which frees and re-allocates
the just allocated array
2. eliminate the newsize recomputation
3. eliminate unnecessary unlock and relock around free
4. correctly match the free with the malloc into M_KQUEUE instead of M_TEMP
5. eliminate conditional assignment of oldlist, which is equivalent to a
simple assignment
6. eliminate the oldlist temporary variable completely

Reviewed by: jhb


# 48e3128b 12-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.


# cd72f218 11-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.


# 13438f68 31-Dec-2002 Alfred Perlstein <alfred@FreeBSD.org>

When compiling the kernel do not implicitly include filedesc.h from proc.h,
this was causing filedesc work to be very painful.
In order to make this work split out sigio definitions to thier own header
(sigio.h) which is included from proc.h for the time being.


# a7010ee2 24-Dec-2002 Poul-Henning Kamp <phk@FreeBSD.org>

White-space changes.


# f3a68211 23-Dec-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Detediousficate declaration of fileops array members by introducing
typedefs for them.


# 9a1b076a 29-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Minor comment typo fix.

Submitted by: Wayne Morrison <tewok@tislabs.com>


# cb81d3ca 03-Oct-2002 Don Lewis <truckman@FreeBSD.org>

hashinit() calls MALLOC(), so release the filedesc lock in knote_attach()
before calling hashinit() and relock afterwards, taking care to see that
we don't lose a race.


# d49fa1ca 16-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

In continuation of early fileop credential changes, modify fo_ioctl() to
accept an 'active_cred' argument reflecting the credential of the thread
initiating the ioctl operation.

- Change fo_ioctl() to accept active_cred; change consumers of the
fo_ioctl() interface to generally pass active_cred from td->td_ucred.
- In fifofs, initialize filetmp.f_cred to ap->a_cred so that the
invocations of soo_ioctl() are provided access to the calling f_cred.
Pass ap->a_td->td_ucred as the active_cred, but note that this is
required because we don't yet distinguish file_cred and active_cred
in invoking VOP's.
- Update kqueue_ioctl() for its new argument.
- Update pipe_ioctl() for its new argument, pass active_cred rather
than td_ucred to MAC for authorization.
- Update soo_ioctl() for its new argument.
- Update vn_ioctl() for its new argument, use active_cred rather than
td->td_ucred to authorize VOP_IOCTL() and the associated VOP_GETATTR().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 49cde51d 16-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Correct white space nits that crept in during my recent merges of
trustedbsd_mac material.


# ea6027a8 15-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Make similar changes to fo_stat() and fo_poll() as made earlier to
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential. Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential. Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument. This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.

Trickle this change down into fo_stat/poll() implementations:

- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL()
to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
and consumers so that this distinction is maintained at the VFS
as well as 'struct file' layer. Pass active_cred instead of
td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.

- fifofs: modify the creation of a "filetemp" so that the file
credential is properly initialized and can be used in the socket
code if desired. Pass ap->a_td->td_ucred as the active
credential to soo_poll(). If we teach the vnop interface about
the distinction between file and active credentials, we would use
the active credential here.

Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained. It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 9ca43589 15-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 7f05b035 28-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

More caddr_t removal, make fo_ioctl take a void * instead of a caddr_t.


# 80208239 28-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

More caddr_t removal.
Change struct knote's kn_hook from caddr_t to void *.


# f44d9e24 18-May-2002 John Baldwin <jhb@FreeBSD.org>

Change p_can{debug,see,sched,signal}()'s first argument to be a thread
pointer instead of a proc pointer and require the process pointed to
by the second argument to be locked. We now use the thread ucred reference
for the credential checks in p_can*() as a result. p_canfoo() should now
no longer need Giant.


# c897b813 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# cd75bfa7 24-Jan-2002 Jonathan Lemon <jlemon@FreeBSD.org>

Add entry for EVFILT_NETDEV, which was inadverdently omitted back in Sept.


# a4db4953 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

Replace ffind_* with fget calls.

Make fget MPsafe.

Make fgetvp and fgetsock use the fget subsystem to reduce code bloat.

Push giant down in fpathconf().


# 426da3bc 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.


# 21d56e9c 29-Dec-2001 Alfred Perlstein <alfred@FreeBSD.org>

Make AIO a loadable module.

Remove the explicit call to aio_proc_rundown() from exit1(), instead AIO
will use at_exit(9).

Add functions at_exec(9), rm_at_exec(9) which function nearly the
same as at_exec(9) and rm_at_exec(9), these functions are called
on behalf of modules at the time of execve(2) after the image
activator has run.

Use a modified version of tegge's suggestion via at_exec(9) to close
an exploitable race in AIO.

Fix SYSCALL_MODULE_HELPER such that it's archetecuterally neutral,
the problem was that one had to pass it a paramater indicating the
number of arguments which were actually the number of "int". Fix
it by using an inline version of the AS macro against the syscall
arguments. (AS should be available globally but we'll get to that
later.)

Add a primative system for dynamically adding kqueue ops, it's really
not as sophisticated as it should be, but I'll discuss with jlemon when
he's around.


# b064d43d 13-Nov-2001 Matthew Dillon <dillon@FreeBSD.org>

remove holdfp()

Replace uses of holdfp() with fget*() or fgetvp*() calls as appropriate

introduce fget(), fget_read(), fget_write() - these functions will take
a thread and file descriptor and return a file pointer with its ref
count bumped.

introduce fgetvp(), fgetvp_read(), fgetvp_write() - these functions will
take a thread and file descriptor and return a vref()'d vnode.

*_read() requires that the file pointer be FREAD, *_write that it be
FWRITE.

This continues the cleanup of struct filedesc and struct file access
routines which, when are all through with it, will allow us to then
make the API calls MP safe and be able to move Giant down into the fo_*
functions.


# 0217f5c7 29-Sep-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Have EVFILT_TIMERS allocate their callouts via malloc() instead of using
the static callout list allocated by the system.

Change malloc type from M_TEMP to M_KQUEUE to better track memory.

Add a kern.kq_calloutmax to globally limit the amount of kernel memory
that can be allocated by callouts.

Submitted by: iedowse (items 1, 2)


# ed01445d 21-Sep-2001 John Baldwin <jhb@FreeBSD.org>

Use the passed in thread to selrecord() instead of curthread.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 116734c4 31-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Pushdown Giant for acct(), kqueue(), kevent(), execve(), fork(),
vfork(), rfork(), jail().


# 5f5c2e95 19-Jul-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce EVFILT_TIMER, which allows a process to establish an
arbitrary number of timers, both oneshot and periodic.

Repeatedly reminded to commit by: jayanth
Reviewed by: peter (a while back)


# a0f75161 05-Jul-2001 Robert Watson <rwatson@FreeBSD.org>

o Replace calls to p_can(..., P_CAN_xxx) with calls to p_canxxx().
The p_can(...) construct was a premature (and, it turns out,
awkward) abstraction. The individual calls to p_canxxx() better
reflect differences between the inter-process authorization checks,
such as differing checks based on the type of signal. This has
a side effect of improving code readability.
o Replace direct credential authorization checks in ktrace() with
invocation of p_candebug(), while maintaining the special case
check of KTR_ROOT. This allows ktrace() to "play more nicely"
with new mandatory access control schemes, as well as making its
authorization checks consistent with other "debugging class"
checks.
o Eliminate "privused" construct for p_can*() calls which allowed the
caller to determine if privilege was required for successful
evaluation of the access control check. This primitive is currently
unused, and as such, serves only to complicate the API.

Approved by: ({procfs,linprocfs} changes) des
Obtained from: TrustedBSD Project


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 33a9ed9d 23-Apr-2001 John Baldwin <jhb@FreeBSD.org>

Change the pfind() and zpfind() functions to lock the process that they
find before releasing the allproc lock and returning.

Reviewed by: -smp, dfr, jake


# e386f9bd 12-Apr-2001 Robert Watson <rwatson@FreeBSD.org>

o Make kqueue's filt_procattach() function use the error value returned
by p_can(...P_CAN_SEE), rather than returning EACCES directly. This
brings the error code used here into line with similar arrangements
elsewhere, and prevents the leakage of pid usage information.

Reviewed by: jlemon
Obtained from: TrustedBSD Project


# 24607d88 23-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Add an EV_SET() convenience macro for initializing struct kevent prior
to the call to kevent().

Update the copyright notices as well.


# 89bbe051 23-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Fix typo in comment (knode -> knote).


# 608a3ce6 15-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Extend kqueue down to the device layer.

Backwards compatible approach suggested by: peter


# e5690aad 23-Jan-2001 John Baldwin <jhb@FreeBSD.org>

Proc locking.


# 0a2c3d48 08-Jan-2001 Garrett Wollman <wollman@FreeBSD.org>

select() DKI is now in <sys/selinfo.h>.


# 7cc0979f 08-Dec-2000 David Malone <dwmalone@FreeBSD.org>

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 279d7226 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>


# 387d2c03 29-Aug-2000 Robert Watson <rwatson@FreeBSD.org>

o Centralize inter-process access control, introducing:

int p_can(p1, p2, operation, privused)

which allows specification of subject process, object process,
inter-process operation, and an optional call-by-reference privused
flag, allowing the caller to determine if privilege was required
for the call to succeed. This allows jail, kern.ps_showallprocs and
regular credential-based interaction checks to occur in one block of
code. Possible operations are P_CAN_SEE, P_CAN_SCHED, P_CAN_KILL,
and P_CAN_DEBUG. p_can currently breaks out as a wrapper to a
series of static function checks in kern_prot, which should not
be invoked directly.

o Commented out capabilities entries are included for some checks.

o Update most inter-process authorization to make use of p_can() instead
of manual checks, PRISON_CHECK(), P_TRESPASS(), and
kern.ps_showallprocs.

o Modify suser{,_xxx} to use const arguments, as it no longer modifies
process flags due to the disabling of ASU.

o Modify some checks/errors in procfs so that ENOENT is returned instead
of ESRCH, further improving concealment of processes that should not
be visible to other processes. Also introduce new access checks to
improve hiding of processes for procfs_lookup(), procfs_getattr(),
procfs_readdir(). Correct a bug reported by bp concerning not
handling the CREATE case in procfs_lookup(). Remove volatile flag in
procfs that caused apparently spurious qualifier warnigns (approved by
bde).

o Add comment noting that ktrace() has not been updated, as its access
control checks are different from ptrace(), whereas they should
probably be the same. Further discussion should happen on this topic.

Reviewed by: bde, green, phk, freebsd-security, others
Approved by: bde
Obtained from: TrustedBSD Project


# ad91b6a2 07-Aug-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Fix bug with timeout; previously, when attempting to poll the kqueue by
passing a zero-valued timeout, the code would always sleep for one tick.
Change code to avoid calling tsleep if we have no intention of sleeping.

Bring in bugfix from sys_select.c, r1.60 which also applies here.

Modify error handling slightly; passing in an invalid fd will now result
in EBADF returned in the eventlist, while an attempt to change a knote
which does not exist will result in ENOENT being returned. Previously
such attempts would fail silently without notification.

Pointed out by: nicolas.leonard@animaths.com
Rick Reed (rr@yahoo-inc.com)


# 1dfd4760 31-Jul-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Back out rev 1.12; its not clear that this is the right thing to do,
and in any event, it wasn't done correctly in the first place.


# c828c7b7 28-Jul-2000 Peter Wemm <peter@FreeBSD.org>

Fix warnings - make kevent args in comment match those in syscalls.master.
Deal with consts.


# ab2adc20 27-Jul-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Have kevent() automatically restart if interrupted by a signal. If this
is not desired, then the user can register an EV_SIGNAL filter to
explicitly catch a signal event.

Change requested by: jayanth, ps, peter
"Why is kevent non-restartable after a signal?"


# 2ba03123 18-Jul-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Fix a bug which would cause some knotes to get lost when two kqueues
were being used in a process at the same time.

Test case provided by: Chris Peiffer <peifferc@CS.Stanford.EDU>


# a8e65b91 18-Jul-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Simplify kqueue API slightly.

Discussed on: -arch


# 0e8363ec 28-Jun-2000 Chris Costello <chris@FreeBSD.org>

Report a file type (S_IFIFO) in kqueue_stat().


# d2693dbb 22-Jun-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Add code so that the udata field is preserved across a TRACK event.

When re-adding an event, do not reset the event state. If the event was
pending, it will remain pending. This allows the user to change the udata
field after the event was registered, while not losing any events which
have already occurred.

Reported by: jmg


# d36cb223 09-Jun-2000 Jonathan Lemon <jlemon@FreeBSD.org>

malloc(..., M_WAITOK) will not return NULL, so remove the error
handling for this case (which was slightly broken anyway)

Fix up some whitespace problems while I'm here too.

Submitted by: alfred (in a slightly different form)


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# a274d19b 21-May-2000 Brian Feldman <green@FreeBSD.org>

Back out NOTE_EXIT status reporting pending discussion.


# a24b514d 16-May-2000 Brian Feldman <green@FreeBSD.org>

Put the wait(2) exit status in "data" for NOTE_EXIT kevents.


# b4b03426 04-May-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Fix one bug where the kn_head list could be manipulated without
spl() protection in the case of a copyout error.

Add missing spl calls around the intial activation call that is
done when when the kevent is added.

Add two KASSERT macros to help catch errors in the future.


# 3ee12e4f 16-Apr-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Add files that I forgot to `cvs add' on last commit.