History log of /openbsd-current/sys/uvm/uvm_mmap.c
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.191 05-Apr-2024 deraadt

delete msyscall stub


# 1.190 05-Apr-2024 deraadt

On machines lacking xonly support hardware, we emulate xonly in the
copyin(9) layer below system calls, using a 4-entry lookup; the 4th
entry is libc.so text. We were assuming, or rather insisting, that
on all our architectures libc.so text is treated as xonly, even if
the linker was behind in it's game.
Since msyscall(2) is gone, kernel no longer has information about the
start,len of libc.so text segment. But we can instead use the (same)
start,len range of pinsyscalls() instead for this purpose.
ld.so is passing the same text-range to the kernel in this position.
regression tests run by anton discovered that libc.so text had become
copyin-readable.
ok kettenis


# 1.189 05-Apr-2024 deraadt

Esure the base,len range provided by ld.so is definately in the map.
Being outside the map doesn't seem like it can do anything bad.
Discussed with kettenis


# 1.188 03-Apr-2024 kettenis

Stopping grabbing the kernel lock in kbind(2).

ok mpi@


# 1.187 02-Apr-2024 deraadt

Delete the msyscall mechanism entirely, since mimmutable+pinsyscalls has
replaced it with a more strict mechanism, which happens to be lockless O(1)
rather than micro-lock O(1)+O(log N). Also nop-out the sys_msyscall(2) guts,
but leave the syscall around for a bit longer so that people can build through
it, since ld.so(1) still wants to call it.


# 1.186 28-Mar-2024 deraadt

Delete pinsyscall(2) [which was specific only to SYS_execve] now
that it has been replaced with pinsyscalls(2) [which tells the kernel
the location of all system calls in libc.so]
floated to various people before release, but it was prudent to wait.


Revision tags: OPENBSD_7_5_BASE
# 1.185 19-Jan-2024 deraadt

remove the guts of pinsyscall(2), it just returns 0 now.
It has been made redundant by the introduction of pinsyscalls(2) which
handles all system calls, rather than just 1.


# 1.184 16-Jan-2024 deraadt

The kernel will now read pinsyscall tables out of PT_OPENBSD_SYSCALLS in
the main program or ld.so, and accept a submission of that information
for libc.so from ld.so via pinsyscalls(2). At system call invocation,
the syscall number is matched to the specific address it must come from.
ok kettenis, gnezdo, testing of variations by many people


# 1.183 07-Dec-2023 deraadt

Add a stub pinsyscalls() system call that simply returns 0 for now,
before future work where ld.so(1) will need this new system call.
Putting this in the kernel ahead of time will save some grief.
ok kettenis


Revision tags: OPENBSD_7_4_BASE
# 1.182 09-May-2023 kn

Inline once-used variable to sync all uvm_map_clean() callers

OK mpi


# 1.181 11-Apr-2023 jsg

fix double words in comments
feedback and ok jmc@ miod, ok millert@


Revision tags: OPENBSD_7_3_BASE
# 1.180 08-Mar-2023 guenther

Delete obsolete /* ARGSUSED */ lint comments.

ok miod@ millert@


# 1.179 16-Feb-2023 deraadt

Add pinsyscall(2). With this you can tell the kernel the location
(start,len) of the syscall stub in libc.so for a specified syscall
(using SYS_* notation). Only SYS_execve is supported at this time.
ok gnezdo mortimer kettenis


# 1.178 11-Feb-2023 deraadt

non-padded 64-bit system calls arrived 2021/12/23, over a year ago.
time to delete the backwards compat padded functions in the kernel.


# 1.177 16-Jan-2023 guenther

Currently we disable kbind(2) for static program from libc.a's
preinit hook. Delete that and instead have the kernel disable kbind
at exec-time if the program doesn't have an ELF interpreter. For
now, permit userland calls to disable it when already disabled so
existing static programs continue to work.

prompted by deraadt@ questioning about the call in libc.a

ok deraadt@ miod@


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.188 03-Apr-2024 kettenis

Stopping grabbing the kernel lock in kbind(2).

ok mpi@


# 1.187 02-Apr-2024 deraadt

Delete the msyscall mechanism entirely, since mimmutable+pinsyscalls has
replaced it with a more strict mechanism, which happens to be lockless O(1)
rather than micro-lock O(1)+O(log N). Also nop-out the sys_msyscall(2) guts,
but leave the syscall around for a bit longer so that people can build through
it, since ld.so(1) still wants to call it.


# 1.186 28-Mar-2024 deraadt

Delete pinsyscall(2) [which was specific only to SYS_execve] now
that it has been replaced with pinsyscalls(2) [which tells the kernel
the location of all system calls in libc.so]
floated to various people before release, but it was prudent to wait.


Revision tags: OPENBSD_7_5_BASE
# 1.185 19-Jan-2024 deraadt

remove the guts of pinsyscall(2), it just returns 0 now.
It has been made redundant by the introduction of pinsyscalls(2) which
handles all system calls, rather than just 1.


# 1.184 16-Jan-2024 deraadt

The kernel will now read pinsyscall tables out of PT_OPENBSD_SYSCALLS in
the main program or ld.so, and accept a submission of that information
for libc.so from ld.so via pinsyscalls(2). At system call invocation,
the syscall number is matched to the specific address it must come from.
ok kettenis, gnezdo, testing of variations by many people


# 1.183 07-Dec-2023 deraadt

Add a stub pinsyscalls() system call that simply returns 0 for now,
before future work where ld.so(1) will need this new system call.
Putting this in the kernel ahead of time will save some grief.
ok kettenis


Revision tags: OPENBSD_7_4_BASE
# 1.182 09-May-2023 kn

Inline once-used variable to sync all uvm_map_clean() callers

OK mpi


# 1.181 11-Apr-2023 jsg

fix double words in comments
feedback and ok jmc@ miod, ok millert@


Revision tags: OPENBSD_7_3_BASE
# 1.180 08-Mar-2023 guenther

Delete obsolete /* ARGSUSED */ lint comments.

ok miod@ millert@


# 1.179 16-Feb-2023 deraadt

Add pinsyscall(2). With this you can tell the kernel the location
(start,len) of the syscall stub in libc.so for a specified syscall
(using SYS_* notation). Only SYS_execve is supported at this time.
ok gnezdo mortimer kettenis


# 1.178 11-Feb-2023 deraadt

non-padded 64-bit system calls arrived 2021/12/23, over a year ago.
time to delete the backwards compat padded functions in the kernel.


# 1.177 16-Jan-2023 guenther

Currently we disable kbind(2) for static program from libc.a's
preinit hook. Delete that and instead have the kernel disable kbind
at exec-time if the program doesn't have an ELF interpreter. For
now, permit userland calls to disable it when already disabled so
existing static programs continue to work.

prompted by deraadt@ questioning about the call in libc.a

ok deraadt@ miod@


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.187 02-Apr-2024 deraadt

Delete the msyscall mechanism entirely, since mimmutable+pinsyscalls has
replaced it with a more strict mechanism, which happens to be lockless O(1)
rather than micro-lock O(1)+O(log N). Also nop-out the sys_msyscall(2) guts,
but leave the syscall around for a bit longer so that people can build through
it, since ld.so(1) still wants to call it.


# 1.186 28-Mar-2024 deraadt

Delete pinsyscall(2) [which was specific only to SYS_execve] now
that it has been replaced with pinsyscalls(2) [which tells the kernel
the location of all system calls in libc.so]
floated to various people before release, but it was prudent to wait.


Revision tags: OPENBSD_7_5_BASE
# 1.185 19-Jan-2024 deraadt

remove the guts of pinsyscall(2), it just returns 0 now.
It has been made redundant by the introduction of pinsyscalls(2) which
handles all system calls, rather than just 1.


# 1.184 16-Jan-2024 deraadt

The kernel will now read pinsyscall tables out of PT_OPENBSD_SYSCALLS in
the main program or ld.so, and accept a submission of that information
for libc.so from ld.so via pinsyscalls(2). At system call invocation,
the syscall number is matched to the specific address it must come from.
ok kettenis, gnezdo, testing of variations by many people


# 1.183 07-Dec-2023 deraadt

Add a stub pinsyscalls() system call that simply returns 0 for now,
before future work where ld.so(1) will need this new system call.
Putting this in the kernel ahead of time will save some grief.
ok kettenis


Revision tags: OPENBSD_7_4_BASE
# 1.182 09-May-2023 kn

Inline once-used variable to sync all uvm_map_clean() callers

OK mpi


# 1.181 11-Apr-2023 jsg

fix double words in comments
feedback and ok jmc@ miod, ok millert@


Revision tags: OPENBSD_7_3_BASE
# 1.180 08-Mar-2023 guenther

Delete obsolete /* ARGSUSED */ lint comments.

ok miod@ millert@


# 1.179 16-Feb-2023 deraadt

Add pinsyscall(2). With this you can tell the kernel the location
(start,len) of the syscall stub in libc.so for a specified syscall
(using SYS_* notation). Only SYS_execve is supported at this time.
ok gnezdo mortimer kettenis


# 1.178 11-Feb-2023 deraadt

non-padded 64-bit system calls arrived 2021/12/23, over a year ago.
time to delete the backwards compat padded functions in the kernel.


# 1.177 16-Jan-2023 guenther

Currently we disable kbind(2) for static program from libc.a's
preinit hook. Delete that and instead have the kernel disable kbind
at exec-time if the program doesn't have an ELF interpreter. For
now, permit userland calls to disable it when already disabled so
existing static programs continue to work.

prompted by deraadt@ questioning about the call in libc.a

ok deraadt@ miod@


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.186 28-Mar-2024 deraadt

Delete pinsyscall(2) [which was specific only to SYS_execve] now
that it has been replaced with pinsyscalls(2) [which tells the kernel
the location of all system calls in libc.so]
floated to various people before release, but it was prudent to wait.


Revision tags: OPENBSD_7_5_BASE
# 1.185 19-Jan-2024 deraadt

remove the guts of pinsyscall(2), it just returns 0 now.
It has been made redundant by the introduction of pinsyscalls(2) which
handles all system calls, rather than just 1.


# 1.184 16-Jan-2024 deraadt

The kernel will now read pinsyscall tables out of PT_OPENBSD_SYSCALLS in
the main program or ld.so, and accept a submission of that information
for libc.so from ld.so via pinsyscalls(2). At system call invocation,
the syscall number is matched to the specific address it must come from.
ok kettenis, gnezdo, testing of variations by many people


# 1.183 07-Dec-2023 deraadt

Add a stub pinsyscalls() system call that simply returns 0 for now,
before future work where ld.so(1) will need this new system call.
Putting this in the kernel ahead of time will save some grief.
ok kettenis


Revision tags: OPENBSD_7_4_BASE
# 1.182 09-May-2023 kn

Inline once-used variable to sync all uvm_map_clean() callers

OK mpi


# 1.181 11-Apr-2023 jsg

fix double words in comments
feedback and ok jmc@ miod, ok millert@


Revision tags: OPENBSD_7_3_BASE
# 1.180 08-Mar-2023 guenther

Delete obsolete /* ARGSUSED */ lint comments.

ok miod@ millert@


# 1.179 16-Feb-2023 deraadt

Add pinsyscall(2). With this you can tell the kernel the location
(start,len) of the syscall stub in libc.so for a specified syscall
(using SYS_* notation). Only SYS_execve is supported at this time.
ok gnezdo mortimer kettenis


# 1.178 11-Feb-2023 deraadt

non-padded 64-bit system calls arrived 2021/12/23, over a year ago.
time to delete the backwards compat padded functions in the kernel.


# 1.177 16-Jan-2023 guenther

Currently we disable kbind(2) for static program from libc.a's
preinit hook. Delete that and instead have the kernel disable kbind
at exec-time if the program doesn't have an ELF interpreter. For
now, permit userland calls to disable it when already disabled so
existing static programs continue to work.

prompted by deraadt@ questioning about the call in libc.a

ok deraadt@ miod@


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.185 19-Jan-2024 deraadt

remove the guts of pinsyscall(2), it just returns 0 now.
It has been made redundant by the introduction of pinsyscalls(2) which
handles all system calls, rather than just 1.


# 1.184 16-Jan-2024 deraadt

The kernel will now read pinsyscall tables out of PT_OPENBSD_SYSCALLS in
the main program or ld.so, and accept a submission of that information
for libc.so from ld.so via pinsyscalls(2). At system call invocation,
the syscall number is matched to the specific address it must come from.
ok kettenis, gnezdo, testing of variations by many people


# 1.183 07-Dec-2023 deraadt

Add a stub pinsyscalls() system call that simply returns 0 for now,
before future work where ld.so(1) will need this new system call.
Putting this in the kernel ahead of time will save some grief.
ok kettenis


Revision tags: OPENBSD_7_4_BASE
# 1.182 09-May-2023 kn

Inline once-used variable to sync all uvm_map_clean() callers

OK mpi


# 1.181 11-Apr-2023 jsg

fix double words in comments
feedback and ok jmc@ miod, ok millert@


Revision tags: OPENBSD_7_3_BASE
# 1.180 08-Mar-2023 guenther

Delete obsolete /* ARGSUSED */ lint comments.

ok miod@ millert@


# 1.179 16-Feb-2023 deraadt

Add pinsyscall(2). With this you can tell the kernel the location
(start,len) of the syscall stub in libc.so for a specified syscall
(using SYS_* notation). Only SYS_execve is supported at this time.
ok gnezdo mortimer kettenis


# 1.178 11-Feb-2023 deraadt

non-padded 64-bit system calls arrived 2021/12/23, over a year ago.
time to delete the backwards compat padded functions in the kernel.


# 1.177 16-Jan-2023 guenther

Currently we disable kbind(2) for static program from libc.a's
preinit hook. Delete that and instead have the kernel disable kbind
at exec-time if the program doesn't have an ELF interpreter. For
now, permit userland calls to disable it when already disabled so
existing static programs continue to work.

prompted by deraadt@ questioning about the call in libc.a

ok deraadt@ miod@


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.184 16-Jan-2024 deraadt

The kernel will now read pinsyscall tables out of PT_OPENBSD_SYSCALLS in
the main program or ld.so, and accept a submission of that information
for libc.so from ld.so via pinsyscalls(2). At system call invocation,
the syscall number is matched to the specific address it must come from.
ok kettenis, gnezdo, testing of variations by many people


# 1.183 07-Dec-2023 deraadt

Add a stub pinsyscalls() system call that simply returns 0 for now,
before future work where ld.so(1) will need this new system call.
Putting this in the kernel ahead of time will save some grief.
ok kettenis


Revision tags: OPENBSD_7_4_BASE
# 1.182 09-May-2023 kn

Inline once-used variable to sync all uvm_map_clean() callers

OK mpi


# 1.181 11-Apr-2023 jsg

fix double words in comments
feedback and ok jmc@ miod, ok millert@


Revision tags: OPENBSD_7_3_BASE
# 1.180 08-Mar-2023 guenther

Delete obsolete /* ARGSUSED */ lint comments.

ok miod@ millert@


# 1.179 16-Feb-2023 deraadt

Add pinsyscall(2). With this you can tell the kernel the location
(start,len) of the syscall stub in libc.so for a specified syscall
(using SYS_* notation). Only SYS_execve is supported at this time.
ok gnezdo mortimer kettenis


# 1.178 11-Feb-2023 deraadt

non-padded 64-bit system calls arrived 2021/12/23, over a year ago.
time to delete the backwards compat padded functions in the kernel.


# 1.177 16-Jan-2023 guenther

Currently we disable kbind(2) for static program from libc.a's
preinit hook. Delete that and instead have the kernel disable kbind
at exec-time if the program doesn't have an ELF interpreter. For
now, permit userland calls to disable it when already disabled so
existing static programs continue to work.

prompted by deraadt@ questioning about the call in libc.a

ok deraadt@ miod@


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.183 07-Dec-2023 deraadt

Add a stub pinsyscalls() system call that simply returns 0 for now,
before future work where ld.so(1) will need this new system call.
Putting this in the kernel ahead of time will save some grief.
ok kettenis


Revision tags: OPENBSD_7_4_BASE
# 1.182 09-May-2023 kn

Inline once-used variable to sync all uvm_map_clean() callers

OK mpi


# 1.181 11-Apr-2023 jsg

fix double words in comments
feedback and ok jmc@ miod, ok millert@


Revision tags: OPENBSD_7_3_BASE
# 1.180 08-Mar-2023 guenther

Delete obsolete /* ARGSUSED */ lint comments.

ok miod@ millert@


# 1.179 16-Feb-2023 deraadt

Add pinsyscall(2). With this you can tell the kernel the location
(start,len) of the syscall stub in libc.so for a specified syscall
(using SYS_* notation). Only SYS_execve is supported at this time.
ok gnezdo mortimer kettenis


# 1.178 11-Feb-2023 deraadt

non-padded 64-bit system calls arrived 2021/12/23, over a year ago.
time to delete the backwards compat padded functions in the kernel.


# 1.177 16-Jan-2023 guenther

Currently we disable kbind(2) for static program from libc.a's
preinit hook. Delete that and instead have the kernel disable kbind
at exec-time if the program doesn't have an ELF interpreter. For
now, permit userland calls to disable it when already disabled so
existing static programs continue to work.

prompted by deraadt@ questioning about the call in libc.a

ok deraadt@ miod@


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.182 09-May-2023 kn

Inline once-used variable to sync all uvm_map_clean() callers

OK mpi


# 1.181 11-Apr-2023 jsg

fix double words in comments
feedback and ok jmc@ miod, ok millert@


Revision tags: OPENBSD_7_3_BASE
# 1.180 08-Mar-2023 guenther

Delete obsolete /* ARGSUSED */ lint comments.

ok miod@ millert@


# 1.179 16-Feb-2023 deraadt

Add pinsyscall(2). With this you can tell the kernel the location
(start,len) of the syscall stub in libc.so for a specified syscall
(using SYS_* notation). Only SYS_execve is supported at this time.
ok gnezdo mortimer kettenis


# 1.178 11-Feb-2023 deraadt

non-padded 64-bit system calls arrived 2021/12/23, over a year ago.
time to delete the backwards compat padded functions in the kernel.


# 1.177 16-Jan-2023 guenther

Currently we disable kbind(2) for static program from libc.a's
preinit hook. Delete that and instead have the kernel disable kbind
at exec-time if the program doesn't have an ELF interpreter. For
now, permit userland calls to disable it when already disabled so
existing static programs continue to work.

prompted by deraadt@ questioning about the call in libc.a

ok deraadt@ miod@


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.181 11-Apr-2023 jsg

fix double words in comments
feedback and ok jmc@ miod, ok millert@


Revision tags: OPENBSD_7_3_BASE
# 1.180 08-Mar-2023 guenther

Delete obsolete /* ARGSUSED */ lint comments.

ok miod@ millert@


# 1.179 16-Feb-2023 deraadt

Add pinsyscall(2). With this you can tell the kernel the location
(start,len) of the syscall stub in libc.so for a specified syscall
(using SYS_* notation). Only SYS_execve is supported at this time.
ok gnezdo mortimer kettenis


# 1.178 11-Feb-2023 deraadt

non-padded 64-bit system calls arrived 2021/12/23, over a year ago.
time to delete the backwards compat padded functions in the kernel.


# 1.177 16-Jan-2023 guenther

Currently we disable kbind(2) for static program from libc.a's
preinit hook. Delete that and instead have the kernel disable kbind
at exec-time if the program doesn't have an ELF interpreter. For
now, permit userland calls to disable it when already disabled so
existing static programs continue to work.

prompted by deraadt@ questioning about the call in libc.a

ok deraadt@ miod@


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.180 08-Mar-2023 guenther

Delete obsolete /* ARGSUSED */ lint comments.

ok miod@ millert@


# 1.179 16-Feb-2023 deraadt

Add pinsyscall(2). With this you can tell the kernel the location
(start,len) of the syscall stub in libc.so for a specified syscall
(using SYS_* notation). Only SYS_execve is supported at this time.
ok gnezdo mortimer kettenis


# 1.178 11-Feb-2023 deraadt

non-padded 64-bit system calls arrived 2021/12/23, over a year ago.
time to delete the backwards compat padded functions in the kernel.


# 1.177 16-Jan-2023 guenther

Currently we disable kbind(2) for static program from libc.a's
preinit hook. Delete that and instead have the kernel disable kbind
at exec-time if the program doesn't have an ELF interpreter. For
now, permit userland calls to disable it when already disabled so
existing static programs continue to work.

prompted by deraadt@ questioning about the call in libc.a

ok deraadt@ miod@


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.179 16-Feb-2023 deraadt

Add pinsyscall(2). With this you can tell the kernel the location
(start,len) of the syscall stub in libc.so for a specified syscall
(using SYS_* notation). Only SYS_execve is supported at this time.
ok gnezdo mortimer kettenis


# 1.178 11-Feb-2023 deraadt

non-padded 64-bit system calls arrived 2021/12/23, over a year ago.
time to delete the backwards compat padded functions in the kernel.


# 1.177 16-Jan-2023 guenther

Currently we disable kbind(2) for static program from libc.a's
preinit hook. Delete that and instead have the kernel disable kbind
at exec-time if the program doesn't have an ELF interpreter. For
now, permit userland calls to disable it when already disabled so
existing static programs continue to work.

prompted by deraadt@ questioning about the call in libc.a

ok deraadt@ miod@


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.178 11-Feb-2023 deraadt

non-padded 64-bit system calls arrived 2021/12/23, over a year ago.
time to delete the backwards compat padded functions in the kernel.


# 1.177 16-Jan-2023 guenther

Currently we disable kbind(2) for static program from libc.a's
preinit hook. Delete that and instead have the kernel disable kbind
at exec-time if the program doesn't have an ELF interpreter. For
now, permit userland calls to disable it when already disabled so
existing static programs continue to work.

prompted by deraadt@ questioning about the call in libc.a

ok deraadt@ miod@


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.177 16-Jan-2023 guenther

Currently we disable kbind(2) for static program from libc.a's
preinit hook. Delete that and instead have the kernel disable kbind
at exec-time if the program doesn't have an ELF interpreter. For
now, permit userland calls to disable it when already disabled so
existing static programs continue to work.

prompted by deraadt@ questioning about the call in libc.a

ok deraadt@ miod@


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.176 04-Jan-2023 jsg

Chuck Cranor rescinded the advertising clause of uvm_mmap.c in
NetBSD rev 1.134 and confirmed with Mike Hibler that the University of
Utah would do the same.

https://mail-index.netbsd.org/source-changes/2011/02/02/msg018021.html

ok deraadt@


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.175 17-Nov-2022 deraadt

stack growth from setrlimit was never updated to set UVM_ET_STACK on
the entries, so the check-sp-at-system-call check failed. Quite strange
it took this long to find this.
ok kettenis


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.174 21-Oct-2022 deraadt

the debug "name" parameter to uvm_map_immutable() is no longer needed


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.173 07-Oct-2022 deraadt

Add mimmutable(2) system call which locks the permissions (PROT_*) of
memory mappings so they cannot be changed by a later mmap(), mprotect(),
or munmap(), which will error with EPERM instead.
ok kettenis


Revision tags: OPENBSD_7_2_BASE
# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.172 01-Aug-2022 deraadt

some ports bootstraps, and go internals, need a bit more time to adapt
to the padded syscalls going away.


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.171 20-Jul-2022 deraadt

the _pad_ system calls from 2021/12/23 can go away
ok guenther


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.170 27-Jun-2022 cheloha

kbind(2): unlock syscall, push kernel lock down to binding loop

- Rearrange the security check code in sys_kbind() so that we only
need to take the kernel lock once if we need to raise SIGILL.

- Protect process.ps_kbind_addr and process.ps_kbind_cookie with
process.ps_mtx. This is easier to do after the aforementioned
rearrangement. Under normal circumstances this isn't necessary:
the process is single-threaded when we initialize kbind(2).
But in stranger situations this brief mutex ensures that the
first thread to reach sys_kbind() initializes both variables.

- Wrap the binding loop with the kernel lock. We need to carefully
confirm that uvm_unmap_remove(), uvm_map_extract(), and
uvm_unmap_detach() are MP-safe in a subsequent patch before
completely removing the kernel lock from sys_kbind().

- Remove the kernel lock from kbind(2) in syscalls.master.

Prompted by mpi@, dlg@, and deraadt@. Current patch workshopped with
deraadt@. Based on a patch from dlg@.

With input from dlg@, bluhm@, mpi@, kettenis@, deraadt@, and
guenther@.

Thread: https://marc.info/?l=openbsd-tech&m=165274831829349&w=2

ok deraadt@ kettenis@ mpi@


Revision tags: OPENBSD_7_1_BASE
# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.169 19-Jan-2022 kn

Grab the kernel lock in uvm_wxcheck() when aborting the process

kern.wxabort=1 logs and kills programs after W^X violations.
At least sigexit() -> coredump() as well as the non-atomic increment of
ps_wxcounter require protection, so grab the big lock for the entire block.

This is part of the effort to unlock mmap(2)'s MAP_ANON case.

Feedback mvs claudio kettenis deraadt
OK kettenis


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.168 05-Jan-2022 guenther

Remove kbind(2)'s restriction that a target buffer not cross page
boundaries: hppa has 8-byte PLT entries that sometimes do that.

ok kettenis@


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.167 23-Dec-2021 guenther

Roll the syscalls that have an off_t argument to remove the explicit padding.
Switch libc and ld.so to the generic stubs for these calls.
WARNING: reboot to updated kernel before installing libc or ld.so!

Time for a story...

When gcc (back in 1.x days) first implemented long long, it didn't (always)
pass 64bit arguments in 'aligned' registers/stack slots, with the result that
argument offsets didn't match structure offsets. This affected the nine system
calls that pass off_t arguments:
ftruncate lseek mmap mquery pread preadv pwrite pwritev truncate

To avoid having to do custom ASM wrappers for those, BSD put an explicit pad
argument in so that the off_t argument would always start on a even slot and
thus be naturally aligned. Thus those odd wrappers in lib/libc/sys/ that use
__syscall() and pass an extra '0' argument.

The ABIs for different CPUs eventually settled how things should be passed on
each and gcc 2.x followed them. The only arch now where it helps is landisk,
which needs to skip the last argument register if it would be the first half of
a 64bit argument. So: add new syscalls without the pad argument and on landisk
do that skipping directly in the syscall handler in the kernel. Keep compat
support for the existing syscalls long enough for the transition.

ok deraadt@


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.166 10-Dec-2021 guenther

Revert "kbind(2): disable system call if not initialized before
first __tfork(2)"

The immediate issue is that a process linked with -znow will still
perform lazy relocation on objects loaded with dlopen(), but there
are possibly other dark corners to plumb to find a better invariant.

Problem reported by thfr@


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.165 05-Dec-2021 cheloha

kbind(2): disable system call if not initialized before first __tfork(2)

To unlock kbind(2) we need to protect ps_kbind_addr and
ps_kbind_cookie.

The simplest way to do this is to disallow kbind(2) initialization
after the first __tfork(2) call. If the first thread does not
initialize the kbind(2) variables before __tfork(2) then we disable
kbind(2) during that first __tfork(2) call.

This is guenther@'s patch, I'm just committing it.

Discussed with guenther@, deraadt@, kettenis@, and mpi@.

ok kettenis@, positive response from mpi@, "I am busy" guenther@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.164 26-Mar-2021 mpi

Remove parenthesis around return value to reduce the diff with NetBSD.

No functional change.

ok mlarkin@


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.163 07-Oct-2020 mpi

Do not release the KERNEL_LOCK() when mmap(2)ing files.

Previous attempt to unlock amap & anon exposed a race in vnode reference
counting. So be conservative with the code paths that we're not fully moving
out of the KERNEL_LOCK() to allow us to concentrate on one area at a time.

The panic reported was:

....panic: vref used where vget required
....db_enter() at db_enter+0x5
....panic() at panic+0x129
....vref(ffffff03b20d29e8) at vref+0x5d
....uvn_attach(1010000,ffffff03a5879dc0) at uvn_attach+0x11d
....uvm_mmapfile(7,ffffff03a5879dc0,2,1,13,100000012) at uvm_mmapfile+0x12c
....sys_mmap(c50,ffff8000225f82a0,1) at sys_mmap+0x604
....syscall() at syscall+0x279

Note that this change has no effect as long as mmap(2) is still executed with
ze big lock.

ok kettenis@


Revision tags: OPENBSD_6_8_BASE
# 1.162 04-Oct-2020 deraadt

Recent changes for PROT_NONE pages to not count against resource limits,
failed to note this also guarded against heavy amap allocations in the
MAP_SHARED case. Bring back the checks for MAP_SHARED
from semarie, ok kettenis
https://syzkaller.appspot.com/bug?extid=d80de26a8db6c009d060


Revision tags: OPENBSD_6_7_BASE
# 1.161 04-Mar-2020 kettenis

branches: 1.161.4;
Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.161 04-Mar-2020 kettenis

Do not count pages mapped as PROT_NONE against the RLIMIT_DATA limit.
Instead count (and check the limit) when their protection gets flipped
from PROT_NONE to something that permits access. This means that
mprotect(2) may now fail if changing the protection would exceed RLIMIT_DATA.

This helps code (such as Chromium's JavaScript interpreter that reserves
large chunks of address space but populates it sparsely.

ok deraadt@, otto@, kurt@, millert@, robert@


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.160 29-Nov-2019 deraadt

Repurpose the "syscalls must be on a writeable page" mechanism to
enforce a new policy: system calls must be in pre-registered regions.
We have discussed more strict checks than this, but none satisfy the
cost/benefit based upon our understanding of attack methods, anyways
let's see what the next iteration looks like.

This is intended to harden (translation: attackers must put extra
effort into attacking) against a mixture of W^X failures and JIT bugs
which allow syscall misinterpretation, especially in environments with
polymorphic-instruction/variable-sized instructions. It fits in a bit
with libc/libcrypto/ld.so random relink on boot and no-restart-at-crash
behaviour, particularily for remote problems. Less effective once on-host
since someone the libraries can be read.

For static-executables the kernel registers the main program's
PIE-mapped exec section valid, as well as the randomly-placed sigtramp
page. For dynamic executables ELF ld.so's exec segment is also
labelled valid; ld.so then has enough information to register libc's
exec section as valid via call-once msyscall(2)

For dynamic binaries, we continue to to permit the main program exec
segment because "go" (and potentially a few other applications) have
embedded system calls in the main program. Hopefully at least go gets
fixed soon.

We declare the concept of embedded syscalls a bad idea for numerous
reasons, as we notice the ecosystem has many of
static-syscall-in-base-binary which are dynamically linked against
libraries which in turn use libc, which contains another set of
syscall stubs. We've been concerned about adding even one additional
syscall entry point... but go's approach tends to double the entry-point
attack surface.

This was started at a nano-hackathon in Bob Beck's basement 2 weeks
ago during a long discussion with mortimer trying to hide from the SSL
scream-conversations, and finished in more comfortable circumstances
next to a wood-stove at Elk Lakes cabin with UVM scream-conversations.

ok guenther kettenis mortimer, lots of feedback from others
conversations about go with jsing tb sthen


# 1.159 28-Nov-2019 mlarkin

Remove end of line whitespace.

No code change.


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.158 27-Nov-2019 deraadt

Add dummy msyscall(2) system call which is currently a noop. This will
be used by kernel and ld.so in the near future. Adding the system call
earlier will reduce the number of people who try to build through and
encounter agony.
ok kettenis guenther


Revision tags: OPENBSD_6_6_BASE
# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.157 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.156 11-May-2019 deraadt

move the noise about W^X mapping failure inside the sysctl kern.wxabort
knob, since we found a proram which tests RWX mapping then changes execution
behaviour to non-W^X.
(that program is chrome, as v8 is heading towards W^X compliance with
mprotect RW/RX swaps, and also has jitless components in developent.)
ok sthen kettenis robert


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


Revision tags: OPENBSD_6_5_BASE
# 1.155 02-Apr-2019 deraadt

BOGO_PC is an invalid userland address, which indicates kbind() is now
disabled in the process. Rather than tying it to KERNBASE, make it simply
-1, which means it even more invalid..
ok tedu


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.154 01-Mar-2019 cheloha

New mmap(2) flag: MAP_CONCEAL.

MAP_CONCEAL'd memory is not written to disk in the event of a core dump.
It may grow other qualities in the future.

Wanted by libressl, probably useful elsewhere, too.

Prompted by deraadt@, concept from deraadt@/kettenis@. With input from
deraadt@, cjeker@, kettenis@, otto@, bcook@, matthew@, guenther@, djm@,
and tedu@.

ok otto@ deraadt@


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

branches: 1.151.2;
Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

branches: 1.147.2;
Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.153 11-Jan-2019 deraadt

mincore() is a relic from the past, exposing physical machine information
about shared resources which no program should see. only a few pieces of
software use it, generally poorly thought out. they are being fixed, so
mincore() can be deleted.
ok guenther tedu jca sthen, others


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.152 10-Jan-2019 tedu

Make mincore lie. The nature of shared memory means it can spy on what
another process is doing. We don't want that, so instead have it
always return that memory is in core.
ok deraadt kettenis


Revision tags: OPENBSD_6_4_BASE
# 1.151 15-Aug-2018 kettenis

Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.151 15-Aug-2018 kettenis

Push back the kernel lock in sys_mmap(2) a little bit more now that
fd_getfile(9) is mpsafe. Note that sys_mmap(2) isn't actually unlocked
currently. However this diff has been tested with it unlocked, and I
hope to unlock it for real soon-ish.

ok visa@, mpi@


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.150 27-Apr-2018 mpi

Move FREF() inside fd_getfile().

ok visa@


# 1.149 12-Apr-2018 deraadt

Implement MAP_STACK option for mmap(). Synchronous faults (pagefault and
syscall) confirm the stack register points at MAP_STACK memory, otherwise
SIGSEGV is delivered. sigaltstack() and pthread_attr_setstack() are modified
to create a MAP_STACK sub-region which satisfies alignment requirements.
Observe that MAP_STACK can only be set/cleared by mmap(), which zeroes the
contents of the region -- there is no mprotect() equivalent operation, so
there is no MAP_STACK-adding gadget.
This opportunistic software-emulation of a stack protection bit makes
stack-pivot operations during ROPchain fragile (kind of like removing a
tool from the toolbox).
original discussion with tedu, uvm work by stefan, testing by mortimer
ok kettenis


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.148 27-Mar-2018 mpi

Make sure that programs violating a pledge(2) promise or some memory
protection cannot block the final SIGABRT.

While here apply the same logic to ddb(4)'s kill command.

From semarie@, ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.147 19-Feb-2018 mpi

Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.147 19-Feb-2018 mpi

Remove almost unused `flags' argument of suser().

The account flag `ASU' will no longer be set but that makes suser()
mpsafe since it no longer mess with a per-process field.

No objection from millert@, ok tedu@, bluhm@


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.146 11-Feb-2018 deraadt

Can mask MAP_STACK by name rather than number


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled


# 1.145 15-Jan-2018 deraadt

mask out (ie. ignore) the bit which will be MAP_STACK in the future,
so diffs in snapshots can exercise the change in a less disruptive way
idea with sthen, ok kettenis tom others


# 1.144 02-Jan-2018 guenther

Stop assuming <sys/file.h> will pull in fcntl.h when _KERNEL is defined.

ok millert@ sthen@


# 1.143 30-Nov-2017 guenther

__MAP_NOFAULT doesn't make sense with anon mappings, so return EINVAL if
that is attempted.
Minor cleanups:
- Eliminate some always false and always true tests against MAP_ANON
- We treat anon mappings with neither MAP_{SHARED,PRIVATE} as MAP_PRIVATE
so explicitly indicate that

ok kettenis@ beck@


Revision tags: OPENBSD_6_1_BASE OPENBSD_6_2_BASE
# 1.142 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.141 05-Oct-2016 guenther

Display/test/use the process PID, not the thread's TID, in a few places.

ok mpi@ mikeb@


# 1.140 16-Sep-2016 dlg

move the uvm_map_addr RB tree from RB macros to the RBT functions

this tree is interesting because it uses all the red black tree
features, specifically the augment callback thats called on tree
topology changes, and it poisons and checks entries as theyre removed
from and inserted back into the tree respectively.

ok stefan@


# 1.139 18-Aug-2016 deraadt

uvm_wxcheck() should only abort the process if kern.wxabort is set.
The new semantics are W^X violations are reported to the application
via ENOTSUP. Forgot to fix this during the last change.
Spotted by kettenis


# 1.138 08-Aug-2016 deraadt

W^X violations are only permitted for binaries marked "wxneeded" on
"wxallowed" filesystems. mmap(2) & mprotect(2) now return ENOTSUP.
(To diagnose buggy programs, consider using sysctl kern.wxabort=1 and
looking at the coredumps)
ok kettenis tedu naddy


Revision tags: OPENBSD_6_0_BASE
# 1.137 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAULT will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.136 13-Jul-2016 kettenis

Revert previous; the __MAP_NOFAULT test is inverted and the commit message is
wrong.


# 1.135 13-Jul-2016 kettenis

Since mappings established using __MAP_NOFAIL will be converted into anonymous
memory if the file backing the mapping is truncated, we should check resource
limits. This prevents callers from triggering a kernel panic and a potential
integer overflow in the amap code by forcing the allocation of too many slots.

Based on an analysis from Jesse Hertz and Tim Newsham.

ok deraadt@


# 1.134 08-Jun-2016 deraadt

Dereference p_p once rather than 4 times.


# 1.133 08-Jun-2016 deraadt

hppa & mips64 now can do the full W^X check. (Make sure you have
a new kernel before this change, and ld.so updated)


# 1.132 04-Jun-2016 sthen

If a process trips the W^X violation check, abort it unless it came
from a filesystem with the wxallowed flag set. ok deraadt

Current status:

Filesystem Binary Action


# 1.131 02-Jun-2016 schwarze

Prevent vsize_t underflow when checking RLIMIT_DATA, which made the
check ineffective when you already had more memory than your limit
allowed.

I noticed after writing this diff that millert@ already committed a fix
for this in rev. 1.74 (2009/06/01), but it got backed out with the giant
pmemrange backout two weeks later and was never restored.

OK tedu@ ("just fix it" and "go ahead with your version")
stefan@ also agrees that a check is needed.


# 1.130 01-Jun-2016 guenther

Delete the kernel compat bits for old mmap() MAP_OLD* flags

ok deraadt@ matthew@ jca@


# 1.129 30-May-2016 deraadt

Identify W^X labelled binaries at execve() time based upon WX_OPENBSD_WXNEEDED
flag set by ld -zwxneeded. Such binaries are allowed to run only on wxallowed
mountpoints. They do not report mmap/mprotect problems.

Rate limit mmap/mprotect reports from other binaries.

These semantics are chosen to encourage progress in the ports ecosystem,
without overwhelming the developers who work in the area.
ok sthen kettenis


# 1.128 30-May-2016 deraadt

backout to insert correct commit message


# 1.127 30-May-2016 deraadt

*** empty log message ***


# 1.126 27-May-2016 deraadt

W^X violations are no longer permitted by default. A kernel log message
is generated, and mprotect/mmap return ENOTSUP. If the sysctl(8) flag
kern.wxabort is set then a SIGABRT occurs instead, for gdb use or coredump
creation.

W^X violating programs can be permitted on a ffs/nfs filesystem-basis,
using the "wxallowed" mount option. One day far in the future
upstream software developers will understand that W^X violations are a
tremendously risky practice and that style of programming will be
banished outright. Until then, we recommend most users need to use the
wxallowed option on their /usr/local filesystem. At least your other
filesystems don't permit such programs.

ok jca kettenis mlarkin natano


# 1.125 11-May-2016 deraadt

remove hppa64 port, which we never got going beyond broken single users.
hppa reverse-stack gives us a valuable test case, but most developers don't
have a 2nd one to proceed further with this.
ok kettenis


# 1.124 29-Mar-2016 chl

Remove dead assignments and now unused variables.

Found by LLVM/Clang Static Analyzer.

ok mpi@ stefan@


# 1.123 09-Mar-2016 deraadt

remove vaxisms


Revision tags: OPENBSD_5_9_BASE
# 1.122 11-Nov-2015 mmcc

branches: 1.122.2;
Remove the superfluous typedef uvm_flag_t (unsigned int). Also, fix an
associated mistake in the uvm manpage.

Suggested by and ok tedu@


# 1.121 01-Nov-2015 semarie

refactor pledge_*_check and pledge_fail functions

- rename _check function without suffix: a "pledge" function called from
anywhere is a "check" function.

- makes pledge_fail call the responsability to the _check function. remove it
from caller.

- make proper use of (potential) returned error of _check() functions.

- adds pledge_kill() and pledge_protexec()

with and OK deraadt@


# 1.120 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.119 30-Sep-2015 semarie

implement new "prot_exec" tame(2) request:
- by default, a tamed-program don't have the possibility to use PROT_EXEC for
mmap(2) or mprotect(2)
- for that, use the request "prot_exec" (that could be dropped later)

initial idea from deraadt@ and kettenis@

"make complete sense" beck@
ok deraadt@


# 1.118 28-Sep-2015 tedu

the kernel lock is no longer needed in the fixed case since uvm_map
will perform the unmap as necessary, holding the vm lock.
reminded by kettenis


# 1.117 28-Sep-2015 tedu

add a flag to indicate to uvm_map that it should unmap to make space.
this pulls all the relevant operations under the same map locking, and
relieves calling code from responsibility.
ok kettenis matthew


# 1.116 26-Sep-2015 tedu

matthew noticed there's a race where we are using the kernel lock to tie
together the unmap and map portions of a fixed mmap. make this explicit
by pulling the lock up higher. in preparation for unlocking the syscall.

there's still (always has been) a race where if the unmap sleeps, another
mmap may see partial results because the map lock isn't held througout.
another problem, another day.


# 1.115 23-Sep-2015 guenther

Corect a kbind comment to describe the version that was settled on: no old
data, only new


# 1.114 06-Sep-2015 deraadt

sizes for free(); ok semarie


# 1.113 25-Aug-2015 guenther

In sys_kbind(), pages from uvm_map_extract() must be written to with kcopy()

ok kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.112 20-Jul-2015 miod

branches: 1.112.4;
Actually return a value from sys_kbind() in the non-ld.so case, or the
compiler will warn.


# 1.111 20-Jul-2015 jsg

include sys/user.h to unbreak the build on at least arm after rev 1.110
ok miod@


# 1.110 20-Jul-2015 guenther

Add kbind, a syscall for ld.so to use to securely and efficiently update
memory for lazy binding

ok deraadt@


# 1.109 07-May-2015 mpi

Pass a thread pointer instead of its file descriptor table to getvnode(9).

Input and ok millert@


# 1.108 30-Mar-2015 miod

Extend uvm_map_hint() to get an address range as extra arguments, and make
sure it will return an address within that range.

Use this in uaddr_rnd_select() to make sure we will not attempt to pick
an address beyond what we are allowed to map.

In my trees for 9 months, blackmailed s2k15 attendees into agreeing now would
be a good time to commit.


Revision tags: OPENBSD_5_7_BASE
# 1.107 13-Feb-2015 millert

Include sys/stdint.h for SIZE_MAX instead of relying on the misplaced
define in sys/limits.h. OK guenther@


# 1.106 07-Feb-2015 tedu

recombine some of the split uvm_mmap functions. the precondition checks
are not necessary because the caller already ensures these. the tail
section for handing mlock can be shared as well.
ok beck guenther


# 1.105 06-Feb-2015 beck

-Split out uvm_mmap and uvm_map into a version for anon's and a version
for everything else.
-Adapt the anon version to be callable without the biglock held.
Done by tedu@, kettenis@ and me.. pounded on a bunch.

This does not yet make mmap a NOLOCK call, but permits it to be so.
ok tedu@, kettenis@, guenther@ jsing@


# 1.104 17-Dec-2014 guenther

Prefer MADV_* over POSIX_MADV_* in kernel for consistency: the latter
doesn't have all the values and therefore can't be used everywhere.

ok deraadt@ kettenis@


# 1.103 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


# 1.102 15-Dec-2014 guenther

Use MAP_INHERIT_* for the 'inh' argument to the UMV_MAPFLAG() macro,
eliminating the must-be-kept-in-sync UVM_INH_* macros

ok deraadt@ tedu@


# 1.101 09-Dec-2014 doug

Sprinkle in a little more mallocarray().

ok deraadt@ tedu@


# 1.100 16-Nov-2014 deraadt

Replace a plethora of historical protection options with just
PROT_NONE, PROT_READ, PROT_WRITE, and PROT_EXEC from mman.h.
PROT_MASK is introduced as the one true way of extracting those bits.
Remove UVM_ADV_* wrapper, using the standard names.
ok doug guenther kettenis


# 1.99 03-Oct-2014 kettenis

Introduce __MAP_NOFAULT, a mmap(2) flag that makes sure a mapping will not
cause a SIGSEGV or SIGBUS when a mapped file gets truncated. Access to
pages that are not backed by a file on such a mapping will be replaced by
zero-filled anonymous pages. Makes passing file descriptors of mapped files
usable without having to play tricks with signal handlers.

"steal your mmap flag" deraadt@


Revision tags: OPENBSD_5_6_BASE
# 1.98 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.97 08-Jul-2014 deraadt

bye bye UBC; ok beck dlg


# 1.96 02-Jul-2014 matthew

Use real parameter types for u{dv,vn}_attach() instead of void *

ok guenther


# 1.95 27-Jun-2014 matthew

Cleanup support for legacy mmap flags

Move all legacy MAP_FOO values behind #ifndef _KERNEL and redefine
them to either be aliases for existing flags (e.g., MAP_COPY ->
MAP_PRIVATE) or 0.

Also, add MAP_OLDFOO defines (behind #ifndef _KERNEL) so the kernel
and kdump can remain compatible with current OpenBSD binaries.

ok deraadt


# 1.94 13-Apr-2014 tedu

compress code by turning four line comments into one line comments.
emphatic ok usual suspects, grudging ok miod


Revision tags: OPENBSD_5_4_BASE OPENBSD_5_5_BASE
# 1.93 30-May-2013 tedu

remove lots of comments about locking per beck's request


# 1.92 30-May-2013 tedu

remove simple_locks from uvm code. ok beck deraadt


Revision tags: OPENBSD_5_2_BASE OPENBSD_5_3_BASE
# 1.91 21-Jul-2012 matthew

Add a new mmap(2) flag __MAP_NOREMAP for use with MAP_FIXED to
indicate that the kernel should fail with MAP_FAILED if the specified
address is not currently available instead of unmapping it.

Change ld.so on i386 to make use of __MAP_NOREMAP to improve
reliability.

__MAP_NOREMAP diff by guenther based on an earlier diff by Ariane;
ld.so bits by guenther and me
bulk build stress testing of earlier diffs by sthen
ok deraadt; committing now for further testing


# 1.90 22-Apr-2012 guenther

Add struct proc * argument to FRELE() and FILE_SET_MATURE() in
anticipation of further changes to closef(). No binary change.

ok krw@ miod@ deraadt@


# 1.89 10-Apr-2012 ariane

Return EINVAL on 0-byte mmap invocation.

- Posix rules that a 0-byte mmap must return EINVAL
- our allocators are unable to distinguish between free memory and
0 bytes of allocated memory


# 1.88 09-Mar-2012 ariane

New vmmap implementation.

no oks (it is really a pain to review properly)
extensively tested, I'm confident it'll be stable
'now is the time' from several icb inhabitants

Diff provides:
- ability to specify different allocators for different regions/maps
- a simpler implementation of the current allocator
- currently in compatibility mode: it will generate similar addresses
as the old allocator


Revision tags: OPENBSD_5_0_BASE OPENBSD_5_1_BASE
# 1.87 09-Jul-2011 matthew

More syscalls.master cleanup:

sys_osigaltstack() is 7 years old and no longer needed; all glory to
the sys_sigaltstack()!

sys_ogetdirentries() is about 9 months old, but still acceptable
within our release cycle; move from STD to COMPAT_48 to make this
clearer for tedu@ next year.

sys_sbrk() and sys_sstk() are completely obsolete: all they do is
return ENOSYS.

ok guenther@


# 1.86 05-Jul-2011 oga

msync has some code that is based on *old* bsd behaviour where
msync(size == 0) did strange things based on the original mapping
segments and trying to manipulate same. This code was copied from the
original vm when we moved to uvm.

posix says nothing about this behaviour and anything that depends on it is
systemically broken, so rip it out and make sys_msync about 30% smaller.

ok deraadt@, tedu@, guenther@.


# 1.85 04-Jul-2011 deraadt

move the specfs code to a place people can see it; ok guenther thib krw


# 1.84 06-Jun-2011 ariane

Backout vmmap in order to repair virtual address selection algorithms
outside the tree.


# 1.83 24-May-2011 ariane

Reimplement uvm/uvm_map.

vmmap is designed to perform address space randomized allocations,
without letting fragmentation of the address space go through the roof.

Some highlights:
- kernel address space randomization
- proper implementation of guardpages
- roughly 10% system time reduction during kernel build

Tested by alot of people on tech@ and developers.
Theo's machines are still happy.


Revision tags: OPENBSD_4_9_BASE
# 1.82 24-Dec-2010 tedu

add a param to uvm_map_hint to not skip over the heap, and use it as a last
resort if mmap fails otherwise to enable more complete address space
utilization. tested for a while with no ill effects.


# 1.81 15-Dec-2010 tedu

add a BRKSIZ define and use it for the heap gap constant, decoupling
heap gap from max data size. nothing else changes yet. ok deraadt


Revision tags: OPENBSD_4_8_BASE
# 1.80 21-May-2010 oga

Fix a locking problem in mincore where it was possible for us to sleep
with a spinlock (even vslocked() buffers may fault in the right
(complicated) situation).

We solve this by preallocating a bounded array for the response and copying the
data out when all locks have been released.

ok thib@, beck@


Revision tags: OPENBSD_4_7_BASE
# 1.79 25-Jul-2009 miod

Add an extra argument to uvm_unmap_remove(), for the caller to tell it
whether removing holes or parts of them is allowed or not.
Only allow hole removal in uvmspace_free(), when tearing the vmspace down.

ok art@


# 1.78 22-Jul-2009 oga

Put the PG_RELEASED changes diff back in.

This has has been tested very very thoroughly on all archs we have
excepting 88k and 68k. Please see cvs log for the individual commit
messages.

ok beck@, thib@


# 1.77 09-Jul-2009 thib

Remove the VREF() macro and replaces all instances with a call to verf(),
which is exactly what the macro does.

Macro's that are nothing more then:
#define FUNCTION(arg) function(arg)
are almost always pointless and should go away.

OK blambert@
Agreed by many.


Revision tags: OPENBSD_4_6_BASE
# 1.76 17-Jun-2009 oga

date based reversion of uvm to the 4th May.

More backouts in line with previous ones, this appears to bring us back to a
stable condition.

A machine forced to 64mb of ram cycled 10GB through swap with this diff
and is still running as I type this. Other tests by ariane@ and thib@
also seem to show that it's alright.

ok deraadt@, thib@, ariane@


# 1.75 16-Jun-2009 oga

Backout all changes to uvm after pmemrange (which will be backed out
separately).

a change at or just before the hackathon has either exposed or added a
very very nasty memory corruption bug that is giving us hell right now.
So in the interest of kernel stability these diffs are being backed out
until such a time as that corruption bug has been found and squashed,
then the ones that are proven good may slowly return.

a quick hitlist of the main commits this backs out:

mine:
uvm_objwire
the lock change in uvm_swap.c
using trees for uvm objects instead of the hash
removing the pgo_releasepg callback.

art@'s:
putting pmap_page_protect(VM_PROT_NONE) in uvm_pagedeactivate() since
all callers called that just prior anyway.

ok beck@, ariane@.

prompted by deraadt@.


# 1.74 01-Jun-2009 millert

Deal with wraparound when checking RLIMIT_DATA.
OK guenther@ otto@


# 1.73 01-Jun-2009 oga

Since we've now cleared up a lot of the PG_RELEASED setting, remove the
pgo_releasepg() hook and just free the page the "normal" way in the one
place we'll ever see PG_RELEASED and should care (uvm_page_unbusy,
called in aiodoned).

ok art@, beck@, thib@


# 1.72 20-Mar-2009 oga

While working on some stuff in uvm I've gotten REALLY sick of reading
K&R function declarations, so switch them all over to ansi-style, in
accordance with the prophesy.

"go for it" art@


Revision tags: OPENBSD_4_5_BASE
# 1.71 10-Nov-2008 deraadt

vm_map_lock() around calls to uvm_map_findspace(); ok tedu


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.70 01-Sep-2007 martin

replace the machine dependant bytes-to-clicks macro by the MI ptoa()
version for i386

more architectures and ctob() replacement is being worked on

prodded by and ok miod


Revision tags: OPENBSD_4_2_BASE
# 1.69 18-Jun-2007 pedro

Bring back Mickey's UVM anon change. Testing by thib@, beck@ and
ckuethe@ for a while. Okay beck@, "it is good timing" deraadt@.


# 1.68 31-May-2007 thib

zap the vm_amap am_l simplelock, and amap_{lock/unlock} macros for
simple_{lock/unlock}.

ok art@


# 1.67 27-Mar-2007 art

Clean up some return value handling now that we know that what's returned
is proper errnos.

millert@ ok and some help


# 1.66 26-Mar-2007 art

Rip out the KERN_ error codes.
ok otto@


# 1.65 25-Mar-2007 art

remove KERN_SUCCESS and use 0 instead.
eyeballed by miod@ and pedro@


Revision tags: OPENBSD_4_1_BASE
# 1.64 25-Feb-2007 millert

Make integer wrap checks the same for mmap, munmap, msync, etc
by factoring most of the checks into a macro. OK otto@


Revision tags: OPENBSD_4_0_BASE
# 1.63 13-Jul-2006 deraadt

Back out the anon change. Apparently it was tested by a few, but most of
us did not see it or get a chance to test it before it was commited. It
broke cvs, in the ami driver, making it not succeed at seeing it's devices.


# 1.62 29-Jun-2006 mickey

fallout from previous: remmapping anonymous memory did not account dsize proper; found by krause and mmap_fixed


# 1.61 21-Jun-2006 mickey

from netbsd: make anons dynamically allocated from pool.
this results in lesse kva waste due to static preallocation of those
for every phys page and also every swap page.
tested by beck krw miod


# 1.60 06-Apr-2006 kurt

Fix a process datasize leak with MAP_FIXED. When zapping old mappings
call uvm_unmap_p instead of uvm_unmap so that it has the process information
and can adjust vm_dused. okay pedro@ tedu@


# 1.59 04-Apr-2006 miod

Revert r1.58, I was on drugs - the array we are locking is one byte per
page, so the arithmetic was ok. Spotted by david@


# 1.58 16-Mar-2006 miod

In sys_mincore(), pass a size in bytes, not pages, to uvm_vslock() and
uvm_vsunlock(). ok mickey@


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE
# 1.57 01-Jun-2005 tedu

branches: 1.57.2; 1.57.4;
use vm_dused for rlimit. much happier with mmap. tested by several
over past week. as a bonus, kills 5 XXXs.


# 1.56 24-May-2005 tedu

add a new field to vm_space and use it to track the number of anon
pages a process uses. this is now the userland "data size" value.
ok art deraadt tdeval. thanks testers.


Revision tags: OPENBSD_3_7_BASE
# 1.55 15-Jan-2005 otto

In uvm_mmap(), check for size wrap to 0, and return ENOMEM in that
case. Do not arbitarily disallow sizes with the high bit set, they
are unsigned. With lotsa help from miod@, test by danh@
ok miod@ millert@ tedu@


Revision tags: OPENBSD_3_6_BASE SMP_SYNC_A SMP_SYNC_B
# 1.54 07-May-2004 tedu

branches: 1.54.2;
align to __LDPGSZ for anon mmap. this allows userland to be compiled
with a static page size on platforms where it may vary.
ok deraadt@ millert@ tdeval@


Revision tags: OPENBSD_3_4_BASE OPENBSD_3_5_BASE
# 1.53 02-Sep-2003 tedu

branches: 1.53.4;
add a random offset to uvm_map_hint. this has the primary effect of
scattering ld.so and libraries around, although all mmaps will also
have some jitter too. better version after some discussion with drahn
testing/ok deraadt henning marcm otto pb


# 1.52 01-Sep-2003 henning

match syscallargs comments with reality
from Patrick Latifi <patrick.l@hermes.usherb.ca>
ok jason@ tedu@


# 1.51 15-Aug-2003 tedu

change arguments to suser. suser now takes the process, and a flags
argument. old cred only calls user suser_ucred. this will allow future
work to more flexibly implement the idea of a root process. looks like
something i saw in freebsd, but a little different.
use of suser_ucred vs suser in file system code should be looked at again,
for the moment semantics remain unchanged.
review and input from art@ testing and further review miod@


# 1.50 06-Aug-2003 millert

Remove some double semicolons (hmm, do two semis equal a maxi?).
I've skipped the GNU stuff for now. From Patrick Latifi.


# 1.49 21-Jul-2003 tedu

enforce restrictions on prot and flags to mprotect and mmap. invalid or
undefined flags are now rejected instead of silently ignored. makes
"unintentional" mprotect calls a touch harder.
ok art@ deraadt@ jason@


# 1.48 01-Jul-2003 tedu

add MAP_TRYFIXED, mostly to help emulate other systems.
when set, uvm will not attempt to avoid a heap address, if requested.
from todd vierling, via
http://marc.theaimsgroup.com/?l=netbsd-tech-kern&m=105612525808607&w=1


# 1.47 01-Jul-2003 tedu

remove sys_omquery. it was only used for two weeks, and you can't
source upgrade from a system that used it anyway.
ok art deraadt drahn


# 1.46 17-May-2003 grange

Typos; from Julien Bordet <zejames@greyhats.org>
Close PR 3262


Revision tags: UBC_SYNC_A
# 1.45 28-Apr-2003 drahn

Change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built, booted, and 'make includes' before building
ld.so with this change.


# 1.44 25-Apr-2003 drahn

backout mquery change, something broke when not combined with a different diff.


# 1.43 25-Apr-2003 drahn

change mquery() function call signature to be the same a mmap(). It
needs the prot/flags info and passing the addresses via arg/return allows
it to be traced via ktrace better than an in/out paramter.
This adds a new mquery syscall and renames the old one to omquery.
New kernel _MUST_ be built and installed before building ld.so with this change.
ok millert@ tedu@


# 1.42 18-Apr-2003 drahn

Return EINVAL if MAP_FIXED was specified but was not available. ok tedu@


# 1.41 17-Apr-2003 drahn

changes to support mquery with 1Gsep on i386. avoid heap on mappings.


# 1.40 14-Apr-2003 art

There are two related changes.

The first one is an mquery(2) syscall. It's for asking the VM system
about where to map things. It will be used by ld.so, read the man page
for details.

The second change is related and is a centralization of uvm_map hint
that all callers of uvm_map calculated. This will allow us to adjust
this hint on architectures that have segments for non-exec mappings.

deraadt@ drahn@ ok.


# 1.39 07-Apr-2003 mpech

int -> ssize_t.
+checked by regress.

millert@, art@ ok.


Revision tags: OPENBSD_3_3_BASE
# 1.38 09-Jan-2003 miod

Remove fetch(9) and store(9) functions from the kernel, and replace the few
remaining instances of them with appropriate copy(9) usage.

ok art@, tested on all arches unless my memory is non-ECC


# 1.37 08-Nov-2002 art

Don't uvm_useracc and then vslock. vslock is better at finding illegal mappings.


# 1.36 29-Oct-2002 art

Since memory deallocation can't fail, remove the error return from
uvm_unmap, uvm_deallocate and a few other functions.
Simplifies some code and reduces diff to the UBC branch.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.35 23-Aug-2002 pvalchev

Fix missing FRELE in mmap(2); ok art


Revision tags: OPENBSD_3_1_BASE
# 1.34 14-Feb-2002 art

Correctly FREF/FRELE in mmap(2).


# 1.33 19-Dec-2001 art

UBC was a disaster. It worked very good when it worked, but on some
machines or some configurations or in some phase of the moon (we actually
don't know when or why) files disappeared. Since we've not been able to
track down the problem in two weeks intense debugging and we need -current
to be stable, back out everything to a state it had before UBC.

We apologise for the inconvenience.


Revision tags: UBC_BASE
# 1.32 10-Dec-2001 art

branches: 1.32.2;
Merge in struct uvm_vnode into struct vnode.


# 1.31 04-Dec-2001 art

Yet another sync to NetBSD uvm.
Today we add a pmap argument to pmap_update() and allocate map entries for
kernel_map from kmem_map instead of using the static entries. This should
get rid of MAX_KMAPENT panics. Also some uvm_loan problems are fixed.


# 1.30 28-Nov-2001 art

Sync in more uvm from NetBSD. Mostly just cosmetic stuff.
Contains also support for page coloring.


# 1.29 28-Nov-2001 art

Sync in more uvm changes from NetBSD.
This time we're getting rid of KERN_* and VM_PAGER_* error codes and
use errnos instead.


# 1.28 27-Nov-2001 art

Merge in the unified buffer cache code as found in NetBSD 2001/03/10. The
code is written mostly by Chuck Silvers <chuq@chuq.com>/<chs@netbsd.org>.

Tested for the past few weeks by many developers, should be in a pretty stable
state, but will require optimizations and additional cleanups.


# 1.27 12-Nov-2001 art

Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.26 09-Nov-2001 art

minor sync to NetBSD.


# 1.25 07-Nov-2001 art

Another sync of uvm to NetBSD. Just minor fiddling, no major changes.


# 1.24 07-Nov-2001 art

Add an alignment argument to uvm_map that specifies an alignment hint
for the virtual address.


# 1.23 06-Nov-2001 art

Move the last content from vm/ to uvm/
The only thing left in vm/ are just dumb wrappers.
vm/vm.h includes uvm/uvm_extern.h
vm/pmap.h includes uvm/uvm_pmap.h
vm/vm_page.h includes uvm/uvm_page.h


# 1.22 05-Nov-2001 art

Minor sync to NetBSD.


# 1.21 26-Oct-2001 art

- every new fd created by falloc() is marked as larval and should not be used
any anyone. Every caller of falloc matures the fd when it's usable.
- Since every lookup in the fd table must now check this flag and all of
them do the same thing, move all the necessary checks into a function -
fd_getfile.


Revision tags: OPENBSD_3_0_BASE
# 1.20 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.19 11-Aug-2001 art

Various random fixes from NetBSD.
Including support for zeroing pages in the idle loop (not enabled yet).


# 1.18 06-Aug-2001 art

Add a new type voff_t (right now it's typedefed as off_t) used for offsets
into objects.

Gives the possibilty to mmap beyond the size of vaddr_t.

From NetBSD.


# 1.17 25-Jul-2001 art

Some updates to UVM from NetBSD. Nothing really critical, just a sync.


# 1.16 25-Jul-2001 art

Change the pmap_enter interface to merge access_type and the wired boolean
and arbitrary flags into one argument.

One new flag is PMAP_CANFAIL that tells pmap_enter that it can fail if there
are not enough resources to satisfy the request. If this flag is not passed,
pmap_enter should panic as it should have done before this change (XXX - many
pmaps are still not doing that).

Only i386 and alpha implement CANFAIL for now.

Includes uvm updates from NetBSD.


# 1.15 23-Jun-2001 smart

Sync with NetBSD 19990911 (just before PMAP_NEW was required)
- thread_sleep_msg() -> uvm_sleep()
- initialize reference count lock in uvm_anon_{init,add}()
- add uao_flush()
- replace boolean 'islocked' with 'lockflags'
- in uvm_fault() change FALSE to TRUE to in 'wide' fault handling
- get rid of uvm_km_get()
- various bug fixes


# 1.14 08-Jun-2001 art

Change the paddr_t pmap_extract(struct pmap *, vaddr_t) interface to
boolean_t pmap_extract(struct pmap *, vaddr_t, paddr_t *).
Matches NetBSD. Tested by various people on various platforms.


# 1.13 10-May-2001 art

More sync to NetBSD.
The highlight is some more advices to madvise(2).
o MADV_DONTNEED will deactive the pages in the given range giving a quicker
reuse.
o MADV_FREE will garbage-collect the pages and swap resources causing the
next fault to either page in new pages from backing store (mapped vnode)
or allocate new zero-fill pages (anonymous mapping).


# 1.12 10-May-2001 art

Some locking protocol fixes and better enforcement of wiring limits.

From NetBSD.


# 1.11 05-May-2001 art

Remove the (vaddr_t) casts inside the round_page and trunc_page macros.
We might want to use them on types that are bigger than vaddr_t.

Fix all callers that pass pointers without casts.


Revision tags: OPENBSD_2_9_BASE
# 1.10 22-Mar-2001 smart

Sync style, typo, and comments a little closer to NetBSD. art@ ok


# 1.9 09-Mar-2001 art

locking typo.


# 1.8 09-Mar-2001 art

Add mlockall and munlockall (dummy for the old vm system).


# 1.7 09-Mar-2001 art

More syncing to NetBSD.

Implements mincore(2), mlockall(2) and munlockall(2). mlockall and munlockall
are disabled for the moment.

The rest is mostly cosmetic.


# 1.6 29-Jan-2001 niklas

$OpenBSD$


Revision tags: OPENBSD_2_7_BASE OPENBSD_2_8_BASE
# 1.5 16-Mar-2000 art

Bring in some new UVM code from NetBSD (not current).

- Introduce a new type of map that are interrupt safe and never allow faults
in them. mb_map and kmem_map are made intrsafe.
- Add "access protection" to uvm_vslock (to be passed down to uvm_fault and
later to pmap_enter).
- madvise(2) now works.
- various cleanups.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.4 23-Aug-1999 art

branches: 1.4.4;
sync with NetBSD from 1999.05.24 (there is a reason for this date)
Mostly cleanups, but also a few improvements to pagedaemon for better
handling of low memory and/or low swap conditions.


# 1.3 04-Jun-1999 art

remove sys_omsync, it's in already in compat. (how did this ever link?)


Revision tags: OPENBSD_2_5_BASE
# 1.2 26-Feb-1999 art

add OpenBSD tags


# 1.1 26-Feb-1999 art

Import of uvm from NetBSD. Some local changes, some code disabled