Cross Reference: /freebsd-10.1-release/sys/sys/proc.h

History log of /freebsd-10.1-release/sys/sys/proc.h
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 272461	02-Oct-2014	gjb	Copy stable/10@r272459 to releng/10.1 as part of the 10.1-RELEASE process. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-10.1-release
# 271372	10-Sep-2014	kib	MFC r271000: Delay the return from thread_single(SINGLE_EXIT) until all threads are really destroyed by thread_stash() after the last switch out. MFC r271007: Retire thread_unthread(). MFC r271008: Style. Approved by: re (marius)
# 270264	21-Aug-2014	kib	MFC r269656: Implement and use proc_realparent(9). MFC r270024 (by markj): Correct the order of arguments passed to LIST_INSERT_AFTER(). For merge, the p_treeflag member of struct proc was moved to the end of the structure, to keep KBI intact.
# 266582	23-May-2014	kib	MFC r266464: In execve(2), postpone the free of old vmspace until the threads are resumed and exited.
# 263875	28-Mar-2014	kib	MFC r263475: Fix two issues with /dev/mem access on amd64, both causing kernel page faults. First, for accesses to direct map region should check for the limit by which direct map is instantiated. Second, for accesses to the kernel map, use a new thread private flag TDP_DEVMEMIO, which instructs vm_fault() to return error when fault happens on the MAP_ENTRY_NOFAULT entry, instead of panicing. MFC r263498: Add change forgotten in r263475. Make dmaplimit accessible outside amd64/pmap.c.
# 260385	06-Jan-2014	scottl	MFC Alexander Motin's GEOM direct dispatch work: r256603: Introduce new function devstat_end_transaction_bio_bt(), adding new argument to specify present time. Use this function to move binuptime() out of lock, substantially reducing lock congestion when slow timecounter is used. r256606: Move g_io_deliver() out of the lock, as required for direct dispatch. Move g_destroy_bio() out too to reduce lock scope even more. r256607: Fix passing uninitialized bio_resid argument to g_trace(). r256610: Add unmapped I/O support to GEOM RAID. r256830: Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping temporary mapped buffer. That fixes double unmap if biodone() called twice for the same BIO (but with different done methods). r256880: Merge GEOM direct dispatch changes from the projects/camlock branch. When safety requirements are met, it allows to avoid passing I/O requests to GEOM g_up/g_down thread, executing them directly in the caller context. That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid several context switches per I/O. r259247: Fix bug introduced at r256607. We have to recalculate bp_resid here since sizes of original and completed requests may differ due to end of media. Testing of the stable/10 merge was done by Netflix, but all of the credit goes to Alexander and iX Systems. Submitted by: mav Sponsored by: iX Systems
# 256281	10-Oct-2013	gjb	Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
# 255708	19-Sep-2013	jhb	Extend the support for exempting processes from being killed when swap is exhausted. - Add a new protect(1) command that can be used to set or revoke protection from arbitrary processes. Similar to ktrace it can apply a change to all existing descendants of a process as well as future descendants. - Add a new procctl(2) system call that provides a generic interface for control operations on processes (as opposed to the debugger-specific operations provided by ptrace(2)). procctl(2) uses a combination of idtype_t and an id to identify the set of processes on which to operate similar to wait6(). - Add a PROC_SPROTECT control operation to manage the protection status of a set of processes. MADV_PROTECT still works for backwards compatability. - Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc) the first bit of which is used to track if P_PROTECT should be inherited by new child processes. Reviewed by: kib, jilles (earlier version) Approved by: re (delphij) MFC after: 1 month
# 254167	09-Aug-2013	cognet	Don't call sleepinit() from proc0_init(), make it a SYSINIT instead. vmem needs the sleepq locks to be initialized when free'ing kva, so we want it called as early as possible.
# 250601	13-May-2013	attilio	o Add accessor functions to add and remove pages from a specific freelist. o Split the pool of free pages queues really by domain and not rely on definition of VM_RAW_NFREELIST. o For MAXMEMDOM > 1, wrap the RR allocation logic into a specific function that is called when calculating the allocation domain. The RR counter is kept, currently, per-thread. In the future it is expected that such function evolves in a real policy decision referee, based on specific informations retrieved by per-thread and per-vm_object attributes. o Add the concept of "probed domains" under the form of vm_ndomains. It is responsibility for every architecture willing to support multiple memory domains to correctly probe vm_ndomains along with mem_affinity segments attributes. Those two values are supposed to remain always consistent. Please also note that vm_ndomains and td_dom_rr_idx are both int because segments already store domains as int. Ideally u_int would have much more sense. Probabilly this should be cleaned up in the future. o Apply RR domain selection also to vm_phys_zero_pages_idle(). Sponsored by: EMC / Isilon storage division Partly obtained from: jeff Reviewed by: alc Tested by: jeff
# 249488	14-Apr-2013	trociny	Similarly to proc_getargv() and proc_getenvv(), export proc_getauxv() to be able to reuse the code. MFC after: 3 weeks
# 249189	06-Apr-2013	glebius	Move CRITICAL_ASSERT() macro to systm.h, where the critical(9) functions are declared.
# 247588	01-Mar-2013	jhb	Replace the TDP_NOSLEEPING flag with a counter so that the THREAD_NO_SLEEPING() and THREAD_SLEEPING_OK() macros can nest. Reviewed by: attilio
# 247454	28-Feb-2013	davide	MFcalloutng: When CPU becomes idle, cpu_idleclock() calculates time to the next timer event in order to reprogram hw timer. Return that time in sbintime_t to the caller and pass it to acpi_cpu_idle(), where it can be used as one more factor (quite precise) to extimate furter sleep time and choose optimal sleep state. This is a preparatory change for further callout improvements will be committed in the next days. The commmit is not targeted for MFC.
# 246484	07-Feb-2013	kib	When vforked child is traced, the debugging events are not generated until child performs exec(). The behaviour is reasonable when a debugger is the real parent, because the parent is stopped until exec(), and sending a debugging event to the debugger would deadlock both parent and child. On the other hand, when debugger is not the parent of the vforked child, not sending debugging signals makes it impossible to debug across vfork. Fix the issue by declining generating debug signals only when vfork() was done and child called ptrace(PT_TRACEME). Set a new process flag P_PPTRACE from the attach code for PT_TRACEME, if P_PPWAIT flag is set, which indicates that the process was created with vfork() and still did not execed. Check P_PPTRACE from issignal(), instead of refusing the trace outright for the P_PPWAIT case. The scope of P_PPTRACE is exactly contained in the scope of P_PPWAIT. Found and tested by: zont Reviewed by: pluknet MFC after: 2 weeks
# 243142	16-Nov-2012	kib	In pget(9), if PGET_NOTWEXIT flag is not specified, also search the zombie list for the pid. This allows several kern.proc sysctls to report useful information for zombies. Hold the allproc_lock around all searches instead of relocking it. Remove private pfind_locked() from the new nfs client code. Requested and reviewed by: pjd Tested by: pho MFC after: 3 weeks
# 242958	13-Nov-2012	kib	Add the wait6(2) system call. It takes POSIX waitid()-like process designator to select a process which is waited for. The system call optionally returns siginfo_t which would be otherwise provided to SIGCHLD handler, as well as extended structure accounting for child and cumulative grandchild resource usage. Allow to get the current rusage information for non-exited processes as well, similar to Solaris. The explicit WEXITED flag is required to wait for exited processes, allowing for more fine-grained control of the events the waiter is interested in. Fix the handling of siginfo for WNOWAIT option for all wait*(2) family, by not removing the queued signal state. PR: standards/170346 Submitted by: "Jukka A. Ukkonen" <jau@iki.fi> MFC after: 1 month
# 242139	26-Oct-2012	trasz	Add CPU percentage limit enforcement to RCTL. The resouce name is "pcpu". It was implemented by Rudolf Tomori during Google Summer of Code 2012.
# 241556	14-Oct-2012	kib	Add a KPI to allow to reserve some amount of space in the numvnodes counter, without actually allocating the vnodes. The supposed use of the getnewvnode_reserve(9) is to reclaim enough free vnodes while the code still does not hold any resources that might be needed during the reclamation, and to consume the slack later for getnewvnode() calls made from the innards. After the critical block is finished, the caller shall free any reserve left, by getnewvnode_drop_reserve(9). Reviewed by: avg Tested by: pho MFC after: 1 week
# 239301	15-Aug-2012	kib	Add a sysctl kern.pid_max, which limits the maximum pid the system is allowed to allocate, and corresponding tunable with the same name. Note that existing processes with higher pids are left intact. MFC after: 1 week
# 237848	30-Jun-2012	kib	Remove stray blank line. MFC after: 3 days
# 236321	30-May-2012	kib	vn_io_fault() is a facility to prevent page faults while filesystems perform copyin/copyout of the file data into the usermode buffer. Typical filesystem hold vnode lock and some buffer locks over the VOP_READ() and VOP_WRITE() operations, and since page fault handler may need to recurse into VFS to get the page content, a deadlock is possible. The facility works by disabling page faults handling for the current thread and attempting to execute i/o while allowing uiomove() to access the usermode mapping of the i/o buffer. If all buffer pages are resident, uiomove() is successfull and request is finished. If EFAULT is returned from uiomove(), the pages backing i/o buffer are faulted in and held, and the copyin/out is performed using uiomove_fromphys() over the held pages for the second attempt of VOP call. Since pages are hold in chunks to prevent large i/o requests from starving free pages pool, and since vnode lock is only taken for i/o over the current chunk, the vnode lock no longer protect atomicity of the whole i/o request. Use newly added rangelocks to provide the required atomicity of i/o regardind other i/o and truncations. Filesystems need to explicitely opt-in into the scheme, by setting the MNTK_NO_IOPF struct mount flag, and optionally by using vn_io_fault_uiomove(9) helper which takes care of calling uiomove() or converting uio into request for uiomove_fromphys(). Reviewed by: bf (comments), mdf, pjd (previous version) Tested by: pho Tested by: flo, Gustau P?rez <gperez entel upc edu> (previous version) MFC after: 2 months
# 236317	30-May-2012	kib	Add a rangelock implementation, intended to be used to range-locking the i/o regions of the vnode data space. The implementation is quite simple-minded, it uses the list of the lock requests, ordered by arrival time. Each request may be for read or for write. The implementation is fair FIFO. MFC after: 2 month
# 236117	26-May-2012	kib	Stop treating td_sigmask specially for the purposes of new thread creation. Move it into the copied region of the struct thread. Update some comments. Requested by: bde X-MFC after: never
# 235850	23-May-2012	kib	Calculate the count of per-process cow faults. Export the count to userspace using the obscure spare int field in struct kinfo_proc. Submitted by: Andrey Zonov <andrey zonov org> MFC after: 1 week
# 234616	23-Apr-2012	kib	Allow for the process information sysctls to accept a thread id in addition to the process id. It follows the ptrace(2) interface and allows debugging libraries to use thread ids directly, without slow and verbose conversion of thread id into pid. The PGET_NOTID flag is provided to allow a specific sysctl to disallow this behaviour. All current callers of pget(9) have useful semantic to operate on tid and do not need this flag. Reviewed by: jhb, trocini MFC after: 1 week
# 234172	12-Apr-2012	kib	Add thread-private flag to indicate that error value is already placed in td_errno. Flag is supposed to be used by syscalls returning EJUSTRETURN because errno was already placed into the usermode frame by a call to set_syscall_retval(9). Both ktrace and dtrace get errno value from td_errno if the flag is set. Use the flag to fix sigsuspend(2) error return ktrace records. Requested by: bde MFC after: 1 week
# 233291	22-Mar-2012	alc	Handle spurious page faults that may occur in no-fault sections of the kernel. When access restrictions are added to a page table entry, we flush the corresponding virtual address mapping from the TLB. In contrast, when access restrictions are removed from a page table entry, we do not flush the virtual address mapping from the TLB. This is exactly as recommended in AMD's documentation. In effect, when access restrictions are removed from a page table entry, AMD's MMUs will transparently refresh a stale TLB entry. In short, this saves us from having to perform potentially costly TLB flushes. In contrast, Intel's MMUs are allowed to generate a spurious page fault based upon the stale TLB entry. Usually, such spurious page faults are handled by vm_fault() without incident. However, when we are executing no-fault sections of the kernel, we are not allowed to execute vm_fault(). This change introduces special-case handling for spurious page faults that occur in no-fault sections of the kernel. In collaboration with: kib Tested by: gibbs (an earlier version) I would also like to acknowledge Hiroki Sato's assistance in diagnosing this problem. MFC after: 1 week
# 232240	27-Feb-2012	kib	Currently, the debugger attached to the process executing vfork() does not get syscall exit notification until the child performed exec of exit. Swap the order of doing ptracestop() and waiting for P_PPWAIT clearing, by postponing the wait into syscallret after ptracestop() notification is done. Reported, tested and reviewed by: Dmitry Mikulin <dmitrym juniper net> MFC after: 2 weeks
# 232048	23-Feb-2012	kib	Allow the parent to gather the exit status of the children reparented to the debugger. When reparenting for debugging, keep the child in the new orphan list of old parent. When looping over the children in kern_wait(), iterate over both children list and orphan list to search for the process by pid. Submitted by: Dmitry Mikulin <dmitrym juniper.net> MFC after: 2 weeks
# 231320	09-Feb-2012	kib	Mark the automatically attached child with PL_FLAG_CHILD in struct lwpinfo flags, for PT_FOLLOWFORK auto-attachment. In collaboration with: Dmitry Mikulin <dmitrym juniper net> MFC after: 1 week
# 231075	06-Feb-2012	kib	Current implementations of sync(2) and syncer vnode fsync() VOP uses mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which is needed to guarantee a synchronous completion of the initiated i/o before syscall or VOP return. Global removal of MNTK_ASYNC option is harmful because not only i/o started from corresponding thread becomes synchronous, but all i/o is synchronous on the filesystem which is initiated during sync(2) or syncer activity. Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local thread flag to disable async i/o for current thread only. Use the opportunity to move DOINGASYNC() macro into sys/vnode.h and consistently use it through places which tested for MNTK_ASYNC. Some testing demonstrated 60-70% improvements in run time for the metadata-intensive operations on async-mounted UFS volumes, but still with great deviation due to other reasons. Reviewed by: mckusick Tested by: scottl MFC after: 2 weeks
# 230643	28-Jan-2012	attilio	Avoid to check the same cache line/variable from all the locking primitives by breaking stop_scheduler into a per-thread variable. Also, store the new td_stopsched very close to td_*locks members as they will be accessed mostly in the same codepaths as td_stopsched and this results in avoiding a further cache-line pollution, possibly. STOP_SCHEDULER() was pondered to use a new 'thread' argument, in order to take advantage of already cached curthread, but in the end there should not really be a performance benefit, while introducing a KPI breakage. In collabouration with: flo Reviewed by: avg MFC after: 3 months (or never) X-MFC: r228424
# 230145	15-Jan-2012	trociny	Abrogate nchr argument in proc_getargv() and proc_getenvv(): we always want to read strings completely to know the actual size. As a side effect it fixes the issue with kern.proc.args and kern.proc.env sysctls, which didn't return the size of available data when calling sysctl(3) with the NULL argument for oldp. Note, in get_ps_strings(), which does actual work for proc_getargv() and proc_getenvv(), we still have a safety limit on the size of data read in case of a corrupted procces stack. Suggested by: kib MFC after: 3 days
# 228648	17-Dec-2011	trociny	On start most of sysctl_kern_proc functions use the same pattern: locate a process calling pfind() and do some additional checks like p_candebug(). To reduce this code duplication a new function pget() is introduced and used. As the function may be useful not only in kern_proc.c it is in the kernel name space. Suggested by: kib Reviewed by: kib MFC after: 2 weeks
# 227833	22-Nov-2011	trociny	Add new sysctls, KERN_PROC_ENV and KERN_PROC_AUXV, to return environment strings and ELF auxiliary vectors from a process stack. Make sysctl_kern_proc_args to read not cached arguments from the process stack. Export proc_getargv() and proc_getenvv() so they can be reused by procfs and linprocfs. Suggested by: kib Reviewed by: kib Discussed with: kib, rwatson, jilles Tested by: pho MFC after: 2 weeks
# 227657	18-Nov-2011	kib	Consistently use process spin lock for protection of the p->p_boundary_count. Race could cause the execve(2) from the threaded process to hung since thread boundary counter was incorrect and single-threading never finished. Reported by: pluknet, pho Tested by: pho MFC after: 1 week
# 227392	09-Nov-2011	kib	Assert that _PRELE() is done for the held process. Tested by: pho MFC after: 1 week
# 225474	11-Sep-2011	kib	Inline the syscallenter() and syscallret(). This reduces the time measured by the syscall entry speed microbenchmarks by ~10% on amd64. Submitted by: jhb Approved by: re (bz) MFC after: 2 weeks
# 224987	18-Aug-2011	jonathan	Add experimental support for process descriptors A "process descriptor" file descriptor is used to manage processes without using the PID namespace. This is required for Capsicum's Capability Mode, where the PID namespace is unavailable. New system calls pdfork(2) and pdkill(2) offer the functional equivalents of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote process for debugging purposes. The currently-unimplemented pdwait(2) will, in the future, allow querying rusage/exit status. In the interim, poll(2) may be used to check (and wait for) process termination. When a process is referenced by a process descriptor, it does not issue SIGCHLD to the parent, making it suitable for use in libraries---a common scenario when using library compartmentalisation from within large applications (such as web browsers). Some observers may note a similarity to Mach task ports; process descriptors provide a subset of this behaviour, but in a UNIX style. This feature is enabled by "options PROCDESC", but as with several other Capsicum kernel features, is not enabled by default in GENERIC 9.0. Reviewed by: jhb, kib Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc
# 223889	09-Jul-2011	kib	Add a facility to disable processing page faults. When activated, uiomove generates EFAULT if any accessed address is not mapped, as opposed to handling the fault. Sponsored by: The FreeBSD Foundation Reviewed by: alc (previous version)
# 223888	09-Jul-2011	kib	Use 'curthread_pflags' instead of 'thread_pflags' to signify that only curthread can be operated upon. Requested by: attilio MFC after: 1 week
# 223886	09-Jul-2011	kib	Implement a helper functions to locally set thread-private flag, and restore it to the previous state. Note that only setting a flag locally is supported. Sponsored by: The FreeBSD Foundation Reviewed by: alc (previous version) MFC after: 1 week
# 223088	14-Jun-2011	obrien	We should not return ECHILD when debugging a child and the parent does a "wait4(-1, ..., WNOHANG, ...)". Instead wait(2) should behave as if the child does not wish to report status at this time. Reviewed by: jhb
# 222100	19-May-2011	jhb	Style fixes: - Sort forward declarations of structures. - Prefer uint64_t to u_int64_t.
# 220621	14-Apr-2011	pluknet	Remove stale M_ZOMBIE malloc type. This type is unused since embedding p_ru into struct proc. MFC after: 1 week
# 220137	29-Mar-2011	trasz	Add racct. It's an API to keep per-process, per-jail, per-loginclass and per-loginclass resource accounting information, to be used by the new resource limits code. It's connected to the build, but the code that actually calls the new functions will come later. Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)
# 219906	23-Mar-2011	jhb	Update a comment. The kernel stopped using the S* process state constants a long time ago.
# 218609	12-Feb-2011	dchagin	Remove unused since r134586 thr_exit1() declaration.
# 218424	07-Feb-2011	mdf	Based on discussions on the svn-src mailing list, rework r218195: - entirely eliminate some calls to uio_yeild() as being unnecessary, such as in a sysctl handler. - move should_yield() and maybe_yield() to kern_synch.c and move the prototypes from sys/uio.h to sys/proc.h - add a slightly more generic kern_yield() that can replace the functionality of uio_yield(). - replace source uses of uio_yield() with the functional equivalent, or in some cases do not change the thread priority when switching. - fix a logic inversion bug in vlrureclaim(), pointed out by bde@. - instead of using the per-cpu last switched ticks, use a per thread variable for should_yield(). With PREEMPTION, the only reasonable use of this is to determine if a lock has been held a long time and relinquish it. Without PREEMPTION, this is essentially the same as the per-cpu variable.
# 217819	25-Jan-2011	kib	Allow debugger to specify that children of the traced process should be automatically traced. Extend the ptrace(PL_LWPINFO) to report that child just forked. Reviewed by: davidxu, jhb MFC after: 2 weeks
# 216313	09-Dec-2010	davidxu	MFp4: It is possible a lower priority thread lending priority to higher priority thread, in old code, it is ignored, however the lending should always be recorded, add field td_lend_user_pri to fix the problem, if a thread does not have borrowed priority, its value is PRI_MAX. MFC after: 1 week
# 213950	17-Oct-2010	davidxu	- Insert thread0 into correct thread hash link list. - In thr_exit() and kthread_exit(), only remove thread from hash if it can directly exit, otherwise let exit1() do it. - In thread_suspend_check(), fix cleanup code when thread needs to exit. This change seems fixed the "Bad link elm " panic found by Peter Holm. Stress testing: pho
# 213714	11-Oct-2010	davidxu	Add a flag TDF_TIDHASH to prevent a thread from being added to or removed from thread hash table multiple times.
# 213642	09-Oct-2010	davidxu	Create a global thread hash table to speed up thread lookup, use rwlock to protect the table. In old code, thread lookup is done with process lock held, to find a thread, kernel has to iterate through process and thread list, this is quite inefficient. With this change, test shows in extreme case performance is dramatically improved. Earlier patch was reviewed by: jhb, julian
# 212999	22-Sep-2010	jhb	Copy td_rqindex during fork instead of zero'ing it to match the comments. I do not believe there is any functional change.
# 212824	18-Sep-2010	kib	Adopt the deferring of object deallocation for the deleted map entries on map unlock to the lock downgrade and later read unlock operation. System map entries cannot be backed by OBJT_VNODE objects, no need to defer deallocation for them. Map entries from user maps do not require the owner map for deallocation, and can be accumulated in the thread-local list for freeing when a user map is unlocked. Move the collection of entries for deferred reclamation into vm_map_delete(). Create helper vm_map_process_deferred(), that is called from locations where processing is feasible. Do not process deferred entries in vm_map_unlock_and_wait() since map_sleep_mtx is held. Reviewed by: alc, rstone (previous versions) Tested by: pho MFC after: 2 weeks
# 210226	18-Jul-2010	trasz	Revert r210225 - turns out I was wrong; the "/*-" is not license-only thing; it's also used to indicate that the comment should not be automatically rewrapped. Explained by: cperciva@
# 210225	18-Jul-2010	trasz	The "/*-" comment marker is supposed to denote copyrights. Remove non-copyright occurences from sys/sys/ and sys/kern/.
# 210138	15-Jul-2010	jhb	Retire td_syscalls now that it is no longer needed.
# 209688	04-Jul-2010	kib	Extend ptrace(PT_LWPINFO) to report siginfo for the signal that caused debugee stop. The change should keep the ABI. Take care of compat32. Discussed with: davidxu, jhb MFC after: 2 weeks
# 209204	15-Jun-2010	kib	Rename CRITSECT_ASSERT to CRITICAL_ASSERT. Suggested by: jhb MFC after: 1 month
# 209197	15-Jun-2010	kib	Add assert to check that the (current) thread is in critical section. MFC after: 1 month
# 208988	10-Jun-2010	mav	Store interrupt trap frame into struct thread. It allows interrupt handler to obtain both trap frame and opaque argument submitted on registrction. After kernel and all drivers get used to it, legacy hack can be removed. Reviewed by: jhb@
# 208453	23-May-2010	kib	Reorganize syscall entry and leave handling. Extend struct sysvec with three new elements: sv_fetch_syscall_args - the method to fetch syscall arguments from usermode into struct syscall_args. The structure is machine-depended (this might be reconsidered after all architectures are converted). sv_set_syscall_retval - the method to set a return value for usermode from the syscall. It is a generalization of cpu_set_syscall_retval(9) to allow ABIs to override the way to set a return value. sv_syscallnames - the table of syscall names. Use sv_set_syscall_retval in kern_sigsuspend() instead of hardcoding the call to cpu_set_syscall_retval(). The new functions syscallenter(9) and syscallret(9) are provided that use sv_syscall pointers and contain the common repeated code from the syscall() implementations for the architecture-specific syscall trap handlers. Syscallenter() fetches arguments, calls syscall implementation from ABI sysent table, and set up return frame. The end of syscall bookkeeping is done by syscallret(). Take advantage of single place for MI syscall handling code and implement ptrace_lwpinfo pl_flags PL_FLAG_SCE, PL_FLAG_SCX and PL_FLAG_EXEC. The SCE and SCX flags notify the debugger that the thread is stopped at syscall entry or return point respectively. The EXEC flag augments SCX and notifies debugger that the process address space was changed by one of exec(2)-family syscalls. The i386, amd64, sparc64, sun4v, powerpc and ia64 syscall()s are changed to use syscallenter()/syscallret(). MIPS and arm are not converted and use the mostly unchanged syscall() implementation. Reviewed by: jhb, marcel, marius, nwhitehorn, stas Tested by: marcel (ia64), marius (sparc64), nwhitehorn (powerpc), stas (mips) MFC after: 1 month
# 207602	04-May-2010	kib	Implement RUSAGE_THREAD. Add td_rux to keep extended runtime and ticks information for thread to allow calcru1() (re)use. Rename ruxagg()->ruxagg_locked(), ruxagg_tlock()->ruxagg() [1]. The ruxagg_locked() function no longer clears thread ticks nor td_incruntime. Requested by: attilio [1] Discussed with: attilio, bde Reviewed by: bde Based on submission by: Alexander Krizhanovsky <ak natsys-lab com> MFC after: 1 week X-MFC-Note: td_rux shall be moved to the end of struct thread
# 207600	04-May-2010	kib	Move definition of struct rusage_ext before struct thread. MFC after: 1 week
# 206264	06-Apr-2010	kib	When OOM searches for a process to kill, ignore the processes already killed by OOM. When killed process waits for a page allocation, try to satisfy the request as fast as possible. This removes the often encountered deadlock, where OOM continously selects the same victim process, that sleeps uninterruptibly waiting for a page. The killed process may still sleep if page cannot be obtained immediately, but testing has shown that system has much higher chance to survive in OOM situation with the patch. In collaboration with: pho Reviewed by: alc MFC after: 4 weeks
# 202882	23-Jan-2010	kib	For PT_TO_SCE stop that stops the ptraced process upon syscall entry, syscall arguments are collected before ptracestop() is called. As a consequence, debugger cannot modify syscall or its arguments. For i386, amd64 and ia32 on amd64 MD syscall(), reread syscall number and arguments after ptracestop(), if debugger modified anything in the process environment. Since procfs stopeven requires number of syscall arguments in p_xstat, this cannot be solved by moving stop/trace point before argument fetching. Move the code to read arguments into separate function fetch_syscall_args() to avoid code duplication. Note that ktrace point for modified syscall is intentionally recorded twice, once with original arguments, and second time with the arguments set by debugger. PT_TO_SCX stop is executed after cpu_syscall_set_retval() already. Reported by: Ali Polatel <alip exherbo org> Briefly discussed with: jhb MFC after: 3 weeks
# 201879	08-Jan-2010	attilio	Introduce the new kernel thread called "deadlock resolver". While the name is pretentious, a good explanation of its targets is reported in this 17 months old presentation e-mail: http://lists.freebsd.org/pipermail/freebsd-arch/2008-August/008452.html In order to implement it, the sq_type in sleepqueues is mandatory and not only compiled along with INVARIANTS option. Additively, a new sleepqueue function, sleepq_type() is added, returning the type of the sleepqueue linked to a wchan. Three new sysctls are added in order to configure the thread: debug.deadlkres.slptime_threshold debug.deadlkres.blktime_threshold debug.deadlkres.sleepfreq rappresenting the thresholds for sleep and block time that will lead to a deadlock matching (when exceeded), while the sleepfreq rappresents the number of seconds between 2 consecutive thread runnings. In order to enable the deadlock resolver thread recompile your kernel with the option DEADLKRES. Reviewed by: jeff Tested by: pho, Giovanni Trematerra Sponsored by: Nokia Incorporated, Sandvine Incorporated MFC after: 2 weeks
# 201790	08-Jan-2010	attilio	- Fix a bug in sched_4bsd where the timestamp for the sleeping operation is not cleaned up on the wakeup but reset. This is harmless mostly because td_slptick (and ki_slptime from userland) should be analyzed only with the assumption that the thread is actually sleeping (thus while the td_slptick is correctly set) but without this invariant the number is nomore consistent. - Move td_slptick from u_int to int in order to follow 'ticks' signedness and wrap up accordingly [0] [0] Submitted by: emaste Sponsored by: Sandvine Incorporated MFC 1 week
# 200732	19-Dec-2009	ed	Let access overriding to TTYs depend on the cdev_priv, not the vnode. Basically this commit changes two things, which improves access to TTYs in exceptional conditions. Basically the problem was that when you ran jexec(8) to attach to a jail, you couldn't use /dev/tty (well, also the node of the actual TTY, e.g. /dev/pts/X). This is very inconvenient if you want to attach to screens quickly, use ssh(1), etc. The fixes: - Cache the cdev_priv of the controlling TTY in struct session. Change devfs_access() to compare against the cdev_priv instead of the vnode. This allows you to bypass UNIX permissions, even across different mounts of devfs. - Extend devfs_prison_check() to unconditionally expose the device node of the controlling TTY, even if normal prison nesting rules normally don't allow this. This actually allows you to interact with this device node. To be honest, I'm not really happy with this solution. We now have to store three pointers to a controlling TTY (s_ttyp, s_ttyvp, s_ttydp). In an ideal world, we should just get rid of the latter two and only use s_ttyp, but this makes certian pieces of code very impractical (e.g. devfs, kern_exit.c). Reported by: Many people
# 199135	10-Nov-2009	kib	Extract the code that records syscall results in the frame into MD function cpu_set_syscall_retval(). Suggested by: marcel Reviewed by: marcel, davidxu PowerPC, ARM, ia64 changes: marcel Sparc64 tested and reviewed by: marius, also sunv reviewed MIPS tested by: gonzo MFC after: 1 month
# 198854	03-Nov-2009	attilio	Split P_NOLOAD into a per-thread flag (TDF_NOLOAD). This improvements aims for avoiding further cache-misses in scheduler specific functions which need to keep track of average thread running time and further locking in places setting for this flag. Reported by: jeff (originally), kris (currently) Reviewed by: jhb Tested by: Giuseppe Cocomazzi <sbudella at email dot it>
# 196730	01-Sep-2009	kib	Reintroduce the r196640, after fixing the problem with my testing. Remove the altkstacks, instead instantiate threads with kernel stack allocated with the right size from the start. For the thread that has kernel stack cached, verify that requested stack size is equial to the actual, and reallocate the stack if sizes differ [1]. This fixes the bug introduced by r173361 that was committed several days after r173004 and consisted of kthread_add(9) ignoring the non-default kernel stack size. Also, r173361 removed the caching of the kernel stacks for a non-first thread in the process. Introduce separate kernel stack cache that keeps some limited amount of preallocated kernel stacks to lower the latency of thread allocation. Add vm_lowmem handler to prune the cache on low memory condition. This way, system with reasonable amount of the threads get lower latency of thread creation, while still not exhausting significant portion of KVA for unused kstacks. Submitted by: peter [1] Discussed with: jhb, julian, peter Reviewed by: jhb Tested by: pho (and retested according to new test scenarious) MFC after: 1 week
# 196648	29-Aug-2009	kib	Reverse r196640 and r196644 for now.
# 196640	29-Aug-2009	kib	Remove the altkstacks, instead instantiate threads with kernel stack allocated with the right size from the start. For the thread that has kernel stack cached, verify that requested stack size is equial to the actual, and reallocate the stack if sizes differ [1]. This fixes the bug introduced by r173361 that was committed several days after r173004 and consisted of kthread_add(9) ignoring the non-default kernel stack size. Also, r173361 removed the caching of the kernel stacks for a non-first thread in the process. Introduce separate kernel stack cache that keeps some limited amount of preallocated kernel stacks to lower the latency of thread allocation. Add vm_lowmem handler to prune the cache on low memory condition. This way, system with reasonable amount of the threads get lower latency of thread creation, while still not exhausting significant portion of KVA for unused kstacks. Submitted by: peter [1] Discussed with: jhb, julian, peter Reviewed by: jhb Tested by: pho MFC after: 1 week
# 195702	14-Jul-2009	kib	Add new msleep(9) flag PBDY that shall be specified together with PCATCH, to indicate that thread shall not be stopped upon receipt of SIGSTOP until it reaches the kernel->usermode boundary. Also change thread_single(SINGLE_NO_EXIT) to only stop threads at the user boundary unconditionally. Tested by: pho Reviewed by: jhb Approved by: re (kensmith)
# 194013	11-Jun-2009	zec	In struct thread, fields td_vnet and td_vnet_lpush may only be accessed via curthread (via appropriate macros), so update the comment indicating the protection model for those two fields.
# 194012	11-Jun-2009	zec	Introduce a mechanism for detecting calls from outbound path of the network stack when reentering the inbound path from netgraph, and force queueing of mbufs at the outbound netgraph node. The mechanism relies on two components. First, in netgraph nodes where outbound path of the network stack calls into netgraph, the current thread has to be appropriately marked using the new NG_OUTBOUND_THREAD_REF() macro before proceeding to call further into the netgraph topology, and unmarked using the NG_OUTBOUND_THREAD_UNREF() macro before returning to the caller. Second, netgraph nodes which can potentially reenter the network stack in the inbound path have to mark their inbound hooks using NG_HOOK_SET_TO_INBOUND() macro. The netgraph framework will then detect when there is a danger of a call graph looping back from outbound to inbound path via netgraph, and defer handing off the mbufs to the "inbound" node to a worker thread with a clean stack. In this first pass only the most obvious netgraph nodes have been updated to ensure no outbound to inbound calls can occur. Nodes such as ng_ipfw, ng_gif etc. should be further examined whether a potential for outbound to inbound call looping exists. This commit changes the layout of struct thread, but due to __FreeBSD_version number shortage a version bump has been omitted at this time, nevertheless kernel and modules have to be rebuilt. Reviewed by: julian, rwatson, bz Approved by: julian (mentor)
# 192462	20-May-2009	jhb	Add a new locking note for p_aioinfo as it is not a normal PROC_LOCK field.
# 191917	08-May-2009	zec	A NOP change: style / whitespace cleanup of the noise that slipped into r191816. Spotted by: bz Approved by: julian (mentor) (an earlier version of the diff)
# 191816	05-May-2009	zec	Change the curvnet variable from a global const struct vnet , previously always pointing to the default vnet context, to a dynamically changing thread-local one. The currvnet context should be set on entry to networking code via CURVNET_SET() macros, and reverted to previous state via CURVNET_RESTORE(). Recursions on curvnet are permitted, though strongly discuouraged. This change should have no functional impact on nooptions VIMAGE kernel builds, where CURVNET_ macros expand to whitespace. The curthread->td_vnet (aka curvnet) variable's purpose is to be an indicator of the vnet context in which the current network-related operation takes place, in case we cannot deduce the current vnet context from any other source, such as by looking at mbuf's m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so far curvnet has turned out to be an invaluable consistency checking aid: it helps to catch cases when sockets, ifnets or any other vnet-aware structures may have leaked from one vnet to another. The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros was a result of an empirical iterative process, whith an aim to reduce recursions on CURVNET_SET() to a minimum, while still reducing the scope of CURVNET_SET() to networking only operations - the alternative would be calling CURVNET_SET() on each system call entry. In general, curvnet has to be set in three typicall cases: when processing socket-related requests from userspace or from within the kernel; when processing inbound traffic flowing from device drivers to upper layers of the networking stack, and when executing timer-driven networking functions. This change also introduces a DDB subcommand to show the list of all vnet instances. Approved by: julian (mentor)
# 189878	16-Mar-2009	kib	Fix two issues with bufdaemon, often causing the processes to hang in the "nbufkv" sleep. First, ffs background cg group block write requests a new buffer for the shadow copy. When ffs_bufwrite() is called from the bufdaemon due to buffers shortage, requesting the buffer deadlock bufdaemon. Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk to not block while allocating the buffer, and return failure instead. Add a flag argument to the geteblk to allow to pass the flags to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer allocation failed and either GB_NOWAIT_BD is specified, or geteblk() is called from bufdaemon (or its helper, see below). In ffs_bufwrite(), fall back to synchronous cg block write if shadow block allocation failed. Since r107847, buffer write assumes that vnode owning the buffer is locked. The second problem is that buffer cache may accumulate many buffers belonging to limited number of vnodes. With such workload, quite often threads that own the mentioned vnodes locks are trying to read another block from the vnodes, and, due to buffer cache exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make any substantial progress because the vnodes are locked. Allow the threads owning vnode locks to help the bufdaemon by doing the flush pass over the buffer cache before getnewbuf() is going to uninterruptible sleep. Move the flushing code from buf_daemon() to new helper function buf_do_flush(), that is called from getnewbuf(). The number of buffers flushed by single call to buf_do_flush() from getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent recursive calls to buf_do_flush() by marking the bufdaemon and threads that temporarily help bufdaemon by TDP_BUFNEED flag. In collaboration with: pho Reviewed by: tegge (previous version) Tested by: glebius, yandex ... MFC after: 3 weeks
# 189573	09-Mar-2009	rwatson	Use a u_int for p_lock instead of a char: this avoids a (somewhat unlikely but not impossible given modern thread counts) wrap-around, and the compiler was padding it out to an int (at least) anyway. MFC after: 3 days (but confirm ABI impact)
# 189571	09-Mar-2009	rwatson	Remove two now-defunct KSE fields from struct thread: td_uuticks and td_usticks.
# 189570	09-Mar-2009	rwatson	Add a new thread-private flag, TDP_AUDITREC, to indicate whether or not there is an audit record hung off of td_ar on the current thread. Test this flag instead of td_ar when auditing syscall arguments or checking for an audit record to commit on syscall return. Under these circumstances, td_pflags is much more likely to be in the cache (especially if there is no auditing of the current system call), so this should help reduce cache misses in the system call return path. MFC after: 1 week Reported by: kris Obtained from: TrustedBSD Project
# 185647	05-Dec-2008	kib	Several threads in a process may do vfork() simultaneously. Then, all parent threads sleep on the parent' struct proc until corresponding child releases the vmspace. Each sleep is interlocked with proc mutex of the child, that triggers assertion in the sleepq_add(). The assertion requires that at any time, all simultaneous sleepers for the channel use the same interlock. Silent the assertion by using conditional variable allocated in the child. Broadcast the variable event on exec() and exit(). Since struct proc * sleep wait channel is overloaded for several unrelated events, I was unable to remove wakeups from the places where cv_broadcast() is added, except exec(). Reported and tested by: ganbold Suggested and reviewed by: jhb MFC after: 2 week
# 185029	17-Nov-2008	pjd	Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes. This bring huge amount of changes, I'll enumerate only user-visible changes: - Delegated Administration Allows regular users to perform ZFS operations, like file system creation, snapshot creation, etc. - L2ARC Level 2 cache for ZFS - allows to use additional disks for cache. Huge performance improvements mostly for random read of mostly static content. - slog Allow to use additional disks for ZFS Intent Log to speed up operations like fsync(2). - vfs.zfs.super_owner Allows regular users to perform privileged operations on files stored on ZFS file systems owned by him. Very careful with this one. - chflags(2) Not all the flags are supported. This still needs work. - ZFSBoot Support to boot off of ZFS pool. Not finished, AFAIK. Submitted by: dfr - Snapshot properties - New failure modes Before if write requested failed, system paniced. Now one can select from one of three failure modes: - panic - panic on write error - wait - wait for disk to reappear - continue - serve read requests if possible, block write requests - Refquota, refreservation properties Just quota and reservation properties, but don't count space consumed by children file systems, clones and snapshots. - Sparse volumes ZVOLs that don't reserve space in the pool. - External attributes Compatible with extattr(2). - NFSv4-ACLs Not sure about the status, might not be complete yet. Submitted by: trasz - Creation-time properties - Regression tests for zpool(8) command. Obtained from: OpenSolaris
# 184667	05-Nov-2008	davidxu	Revert rev 184216 and 184199, due to the way the thread_lock works, it may cause a lockup. Noticed by: peter, jhb
# 184199	23-Oct-2008	davidxu	Actually, for signal and thread suspension, extra process spin lock is unnecessary, the normal process lock and thread lock are enough. The spin lock is still needed for process and thread exiting to mimic single sched_lock.
# 183911	15-Oct-2008	davidxu	Move per-thread userland debugging flags into seperated field, this eliminates some problems of locking, e.g, a thread lock is needed but can not be used at that time. Only the process lock is needed now for new field.
# 183073	16-Sep-2008	kib	When attempt is made to suspend a filesystem that is already syspended, wait until the current suspension is lifted instead of silently returning success immediately. The consequences of calling vfs_write() resume when not owning the suspension are not well-defined at best. Add the vfs_susp_clean() mount method to be called from vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and stop calling it manually. Add the thread flag TDP_IGNSUSP that allows to bypass the suspension point in the vn_start_write. It is intended for use by VFS in the situations where the suspender want to do some i/o requiring calls to vn_start_write(), and this i/o cannot be done later. Reviewed by: tegge In collaboration with: pho MFC after: 1 month
# 182011	22-Aug-2008	jhb	A suspended thread can, in fact, be swapped out. Thus, thread_unsuspend_one() needs to optionally wakeup the swapper. Since we hold the thread lock for that entire function, however, we have to push that requirement up into the caller. Found by: rwatson
# 181905	20-Aug-2008	ed	Integrate the new MPSAFE TTY layer to the FreeBSD operating system. The last half year I've been working on a replacement TTY layer for the FreeBSD kernel. The new TTY layer was designed to improve the following: - Improved driver model: The old TTY layer has a driver model that is not abstract enough to make it friendly to use. A good example is the output path, where the device drivers directly access the output buffers. This means that an in-kernel PPP implementation must always convert network buffers into TTY buffers. If a PPP implementation would be built on top of the new TTY layer (still needs a hooks layer, though), it would allow the PPP implementation to directly hand the data to the TTY driver. - Improved hotplugging: With the old TTY layer, it isn't entirely safe to destroy TTY's from the system. This implementation has a two-step destructing design, where the driver first abandons the TTY. After all threads have left the TTY, the TTY layer calls a routine in the driver, which can be used to free resources (unit numbers, etc). The pts(4) driver also implements this feature, which means posix_openpt() will now return PTY's that are created on the fly. - Improved performance: One of the major improvements is the per-TTY mutex, which is expected to improve scalability when compared to the old Giant locking. Another change is the unbuffered copying to userspace, which is both used on TTY device nodes and PTY masters. Upgrading should be quite straightforward. Unlike previous versions, existing kernel configuration files do not need to be changed, except when they reference device drivers that are listed in UPDATING. Obtained from: //depot/projects/mpsafetty/... Approved by: philip (ex-mentor) Discussed: on the lists, at BSDCan, at the DevSummit Sponsored by: Snow B.V., the Netherlands dcons(4) fixed by: kan
# 181334	05-Aug-2008	jhb	If a thread that is swapped out is made runnable, then the setrunnable() routine wakes up proc0 so that proc0 can swap the thread back in. Historically, this has been done by waking up proc0 directly from setrunnable() itself via a wakeup(). When waking up a sleeping thread that was swapped out (the usual case when waking proc0 since only sleeping threads are eligible to be swapped out), this resulted in a bit of recursion (e.g. wakeup() -> setrunnable() -> wakeup()). With sleep queues having separate locks in 6.x and later, this caused a spin lock LOR (sleepq lock -> sched_lock/thread lock -> sleepq lock). An attempt was made to fix this in 7.0 by making the proc0 wakeup use the ithread mechanism for doing the wakeup. However, this required grabbing proc0's thread lock to perform the wakeup. If proc0 was asleep elsewhere in the kernel (e.g. waiting for disk I/O), then this degenerated into the same LOR since the thread lock would be some other sleepq lock. Fix this by deferring the wakeup of the swapper until after the sleepq lock held by the upper layer has been locked. The setrunnable() routine now returns a boolean value to indicate whether or not proc0 needs to be woken up. The end result is that consumers of the sleepq API such as *sleep/wakeup, condition variables, sx locks, and lockmgr, have to wakeup proc0 if they get a non-zero return value from sleepq_abort(), sleepq_broadcast(), or sleepq_signal(). Discussed with: jeff Glanced at by: sam Tested by: Jurgen Weber jurgen - ish com au MFC after: 2 weeks
# 180799	25-Jul-2008	kib	Call pargs_drop() unconditionally in do_execve(), the function correctly handles the NULL argument. Make pargs_free() static. MFC after: 1 week
# 179175	21-May-2008	kib	Implement the per-open file data for the cdev. The patch does not change the cdevsw KBI. Management of the data is provided by the functions int devfs_set_cdevpriv(void priv, cdevpriv_dtr_t dtr); int devfs_get_cdevpriv(void *datap); void devfs_clear_cdevpriv(void); All of the functions are supposed to be called from the cdevsw method contexts. - devfs_set_cdevpriv assigns the priv as private data for the file descriptor which is used to initiate currently performed driver operation. dtr is the function that will be called when either the last refernce to the file goes away, the device is destroyed or devfs_clear_cdevpriv is called. - devfs_get_cdevpriv is the obvious accessor. - devfs_clear_cdevpriv allows to clear the private data for the still open file. Implementation keeps the driver-supplied pointers in the struct cdev_privdata, that is referenced both from the struct file and struct cdev, and cannot outlive any of the referee. Man pages will be provided after the KPI stabilizes. Reviewed by: jhb Useful suggestions from: jeff, antoine Debugging help and tested by: pho MFC after: 1 month
# 179092	18-May-2008	jb	Add the hooks for the extra data that DTrace allocates for struct thread and struct proc. Add a field to struct thread to stash the error variable (or returned status) from the last syscall so that it is available during a DTrace probe.
# 178888	09-May-2008	julian	Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco
# 178471	25-Apr-2008	jeff	- Add an integer argument to idle to indicate how likely we are to wake from idle over the next tick. - Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are suspended in cpu specific states. This function can fail and cause the scheduler to fall back to another mechanism (ipi). - Implement support for mwait in cpu_idle() on i386/amd64 machines that support it. mwait is a higher performance way to synchronize cpus as compared to hlt & ipis. - Allow selecting the idle routine by name via sysctl machdep.idle. This replaces machdep.cpu_idle_hlt. Only idle routines supported by the current machine are permitted. Sponsored by: Nokia
# 178272	17-Apr-2008	jeff	- Make SCHED_STATS more generic by adding a wrapper to create the variables and sysctl nodes. - In reset walk the children of kern_sched_stats and reset the counters via the oid_arg1 pointer. This allows us to add arbitrary counters to the tree and still reset them properly. - Define a set of switch types to be passed with flags to mi_switch(). These types are named SWT_*. These types correspond to SCHED_STATS counters and are automatically handled in this way. - Make the new SWT_ types more specific than the older switch stats. There are now stats for idle switches, remote idle wakeups, remote preemption ithreads idling, etc. - Add switch statistics for ULE's pickcpu algorithm. These stats include how much migration there is, how often affinity was successful, how often threads were migrated to the local cpu on wakeup, etc. Sponsored by: Nokia
# 177957	06-Apr-2008	attilio	Optimize lockmgr in order to get rid of the pool mutex interlock, of the state transitioning flags and of msleep(9) callings. Use, instead, an algorithm very similar to what sx(9) and rwlock(9) alredy do and direct accesses to the sleepqueue(9) primitive. In order to avoid writer starvation a mechanism very similar to what rwlock(9) uses now is implemented, with the correspective per-thread shared lockmgrs counter. This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and lockmgr_args_rw(). These two are like the 2 "normal" versions, but they both accept a rwlock as interlock. In order to realize this, the general lockmgr manager function "__lockmgr_args()" has been implemented through the generic lock layer. It supports all the blocking primitives, but currently only these 2 mappers live. The patch drops the support for WITNESS atm, but it will be probabilly added soon. Also, there is a little race in the draining code which is also present in the current CVS stock implementation: if some sharers, once they wakeup, are in the runqueue they can contend the lock with the exclusive drainer. This is hard to be fixed but the now committed code mitigate this issue a lot better than the (past) CVS version. In addition assertive KA_HELD and KA_UNHELD have been made mute assertions because they are dangerous and they will be nomore supported soon. In order to avoid namespace pollution, stack.h is splitted into two parts: one which includes only the "struct stack" definition (_stack.h) and one defining the KPI. In this way, newly added _lockmgr.h can just include _stack.h. Kernel ABI results heavilly changed by this commit (the now committed version of "struct lock" is a lot smaller than the previous one) and KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction, so manpages and __FreeBSD_version will be updated accordingly. Tested by: kris, pho, jeff, danger Reviewed by: jeff Sponsored by: Google, Summer of Code program 2007
# 177471	21-Mar-2008	jeff	- Add a new td flag TDF_NEEDSUSPCHK that is set whenever a thread needs to enter thread_suspend_check(). - Set TDF_ASTPENDING along with TDF_NEEDSUSPCHK so we can move the thread_suspend_check() to ast() rather than userret(). - Check TDF_NEEDSUSPCHK in the sleepq_catch_signals() optimization so that we don't miss a suspend request. If this is set use the expensive signal path. - Set NEEDSUSPCHK when creating a new thread in thr in case the creating thread is due to be suspended as well but has not yet. Reviewed by: davidxu (Authored original patch)
# 177435	20-Mar-2008	jeff	- Restore runq to manipulating threads directly by putting runq links and rqindex back in struct thread. - Compile kern_switch.c independently again and stop #include'ing it from schedulers. - Remove the ts_thread backpointers and convert most code to go from struct thread to struct td_sched. - Cleanup the ts_flags #define garbage that was causing us to sometimes do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD. - Export the kern.sched sysctl node in sysctl.h
# 177368	19-Mar-2008	jeff	- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from requiring the per-process spinlock to only requiring the process lock. - Reflect these changes in the proc.h documentation and consumers throughout the kernel. This is a substantial reduction in locking cost for these fields and was made possible by recent changes to threading support.
# 177091	12-Mar-2008	jeff	Remove kernel support for M:N threading. While the KSE project was quite successful in bringing threading to FreeBSD, the M:N approach taken by the kse library was never developed to its full potential. Backwards compatibility will be provided via libmap.conf for dynamically linked binaries and static binaries will be broken.
# 177085	12-Mar-2008	jeff	- Pass the priority argument from sleep() into sleepq and down into sched_sleep(). This removes extra thread_lock() acquisition and allows the scheduler to decide what to do with the static boost. - Change the priority arguments to cv_ to match sleepq/msleep/etc. where 0 means no priority change. Catch -1 in cv_broadcastpri() and convert it to 0 for now. - Set a flag when sleeping in a way that is compatible with swapping since direct priority comparisons are meaningless now. - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which controls the boost behavior. Turning it off gives better performance in some workloads but needs more investigation. - While we're modifying sleepq, change signal and broadcast to both return with the lock held as the lock was held on enter. Reviewed by: jhb, peter
# 176730	02-Mar-2008	jeff	Add cpuset, an api for thread to cpu binding and cpu resource grouping and assignment. - Add a reference to a struct cpuset in each thread that is inherited from the thread that created it. - Release the reference when the thread is destroyed. - Add prototypes for syscalls and macros for manipulating cpusets in sys/cpuset.h - Add syscalls to create, get, and set new numbered cpusets: cpuset(), cpuset_{get,set}id() - Add syscalls for getting and setting affinity masks for cpusets or individual threads: cpuid_{get,set}affinity() - Add types for the 'level' and 'which' parameters for the cpuset. This will permit expansion of the api to cover cpu masks for other objects identifiable with an id_t integer. For example, IRQs and Jails may be coming soon. - The root set 0 contains all valid cpus. All thread initially belong to cpuset 1. This permits migrating all threads off of certain cpus to reserve them for special applications. Sponsored by: Nokia Discussed with: arch, rwatson, brooks, davidxu, deischen Reviewed by: antoine
# 176078	07-Feb-2008	jeff	- Add THREAD_LOCKPTR_ASSERT() to assert that the thread's lock points at the provided lock or &blocked_lock. The thread may be temporarily assigned to the blocked_lock by the scheduler so a direct comparison can not always be made. - Use THREAD_LOCKPTR_ASSERT() in the primary consumers of the scheduling interfaces. The schedulers themselves still use more explicit asserts. Sponsored by: Nokia
# 176017	05-Feb-2008	jeff	Adaptive spinning in write path with readers and writer starvation avoidance. - Move recursion checking into rwlock inlines to free a bit for use with adaptive spinners. - Clear the RW_LOCK_WRITE_SPINNERS flag whenever the lock state changes causing write spinners to restart their loop. - Write spinners are limited by a count while readers hold the lock as there is no way to know for certain whether readers are running still. - In the read path block if there are write waiters or spinners to avoid starving writers. Use a new per-thread count, td_rw_rlocks, to skip starvation avoidance if it might cause a deadlock. - Remove or change invalid assertions in turnstiles. Reviewed by: attilio (developed parts of the patch as well) Sponsored by: Nokia
# 175846	31-Jan-2008	mav	Move GET_STACK_USAGE from MI header to i386/amd64 MD ones. Somebody who can, please feel free to implement it for other archs or copy this one if it suits.
# 175835	30-Jan-2008	mav	Implement GET_STACK_USAGE() macro to get the current kernel thread stack usage. This implemntation made for growing down stack organization like i386/amd64 platforms have, but prefers different machine dependent version if it is present.
# 175219	10-Jan-2008	rwatson	Don't zero td_runtime when billing thread CPU usage to the process; maintain a separate td_incruntime to hold unbilled CPU usage for the thread that has the previous properties of td_runtime. When thread information is requested using the thread monitoring sysctls, export thread td_runtime instead of process rusage runtime in kinfo_proc. This restores the display of individual ithread and other kernel thread CPU usage since inception in ps -H and top -SH, as well for libthr user threads, valuable debugging information lost with the move to try kthreads since they are no longer independent processes. There is universal agreement that we should rewrite the process and thread export sysctls, but this commit gets things going a bit better in the mean time. Likewise, there are resevations about the continued validity of statclock given the speed of modern processors. Reviewed by: attilio, emaste, jhb, julian
# 174647	16-Dec-2007	jeff	Refactor select to reduce contention and hide internal implementation details from consumers. - Track individual selecters on a per-descriptor basis such that there are no longer collisions and after sleeping for events only those descriptors which triggered events must be rescaned. - Protect the selinfo (per descriptor) structure with a mtx pool mutex. mtx pool mutexes were chosen to preserve api compatibility with existing code which does nothing but bzero() to setup selinfo structures. - Use a per-thread wait channel rather than a global wait channel. - Hide select implementation details in a seltd structure which is opaque to the rest of the kernel. - Provide a 'selsocket' interface for those kernel consumers who wish to select on a socket when they have no fd so they no longer have to be aware of select implementation details. Tested by: kris Reviewed on: arch
# 174629	15-Dec-2007	jeff	- Re-implement lock profiling in such a way that it no longer breaks the ABI when enabled. There is no longer an embedded lock_profile_object in each lock. Instead a list of lock_profile_objects is kept per-thread for each lock it may own. The cnt_hold statistic is now always 0 to facilitate this. - Support shared locking by tracking individual lock instances and statistics in the per-thread per-instance lock_profile_object. - Make the lock profiling hash table a per-cpu singly linked list with a per-cpu static lock_prof allocator. This removes the need for an array of spinlocks and reduces cache contention between cores. - Use a seperate hash for spinlocks and other locks so that only a critical_enter() is required and not a spinlock_enter() to modify the per-cpu tables. - Count time spent spinning in the lock statistics. - Remove the LOCK_PROFILE_SHARED option as it is always supported now. - Specifically drop and release the scheduler locks in both schedulers since we track owners now. In collaboration with: Kip Macy Sponsored by: Nokia
# 174253	04-Dec-2007	kib	Implement fetching of the __FreeBSD_version from the ELF ABI-tag note. The value is read into the p_osrel member of the struct proc. p_osrel is set to 0 for the binaries without the note. MFC after: 3 days
# 173615	14-Nov-2007	marcel	o Rename cpu_thread_setup() to cpu_thread_alloc() to better communicate that it relates to (is called by) thread_alloc() o Add cpu_thread_free() which is called from thread_free() to counter-act cpu_thread_alloc(). i386: Have cpu_thread_free() call cpu_thread_clean() to preserve behaviour. ia64: Have cpu_thread_free() call mtx_destroy() for the mutex initialized in cpu_thread_alloc(). PR: ia64/118024
# 173594	14-Nov-2007	jkoshy	Reserve a bit for use when capturing callchains.
# 173361	05-Nov-2007	kib	Fix for the panic("vm_thread_new: kstack allocation failed") and silent NULL pointer dereference in the i386 and sparc64 pmap_pinit() when the kmem_alloc_nofault() failed to allocate address space. Both functions now return error instead of panicing or dereferencing NULL. As consequence, vmspace_exec() and vmspace_unshare() returns the errno int. struct vmspace arg was added to vm_forkproc() to avoid dealing with failed allocation when most of the fork1() job is already done. The kernel stack for the thread is now set up in the thread_alloc(), that itself may return NULL. Also, allocation of the first process thread is performed in the fork1() to properly deal with stack allocation failure. proc_linkup() is separated into proc_linkup() called from fork1(), and proc_linkup0(), that is used to set up the kernel process (was known as swapper). In collaboration with: Peter Holm Reviewed by: jhb
# 173004	26-Oct-2007	julian	Introduce a way to make pure kernal threads. kthread_add() takes the same parameters as the old kthread_create() plus a pointer to a process structure, and adds a kernel thread to that process. kproc_kthread_add() takes the parameters for kthread_add, plus a process name and a pointer to a pointer to a process instead of just a pointer, and if the proc * is NULL, it creates the process to the specifications required, before adding the thread to it. All other old kthread_xxx() calls return, but act on (struct thread ) instead of (struct proc ). One reason to change the name is so that any old kernel modules that are lying around and expect kthread_create() to make a process will not just accidentally link. fix top to show kernel threads by their thread name in -SH mode add a tdnam formatting option to ps to show thread names. make all idle threads actual kthreads and put them into their own idled process. make all interrupt threads kthreads and put them in an interd process (mainly for aesthetic and accounting reasons) rename proc 0 to be 'kernel' and it's swapper thread is now 'swapper' man page fixes to follow.
# 172264	21-Sep-2007	jeff	- Redefine p_swtime and td_slptime as p_swtick and td_slptick. This changes the units from seconds to the value of 'ticks' when swapped in/out. ULE does not have a periodic timer that scans all threads in the system and as such maintaining a per-second counter is difficult. - Change computations requiring the unit in seconds to subtract ticks and divide by hz. This does make the wraparound condition hz times more frequent but this is still in the range of several months to years and the adverse effects are minimal. Approved by: re
# 172207	17-Sep-2007	jeff	- Move all of the PS_ flags into either p_flag or td_flags. - p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or previously the sched_lock. These bugs have existed for some time. - Allow swapout to try each thread in a process individually and then swapin the whole process if any of these fail. This allows us to move most scheduler related swap flags into td_flags. - Keep ki_sflag for backwards compat but change all in source tools to use the new and more correct location of P_INMEM. Reported by: pho Reviewed by: attilio, kib Approved by: re (kensmith)
# 171611	27-Jul-2007	attilio	Actually, upcalls cannot be freed while destroying the thread because we should call uma_zfree() with various spinlock helds. Rearranging the code would not help here because we cannot break atomicity respect prcess spinlock, so the only one choice we have is to defer the operation. In order to do this use a global queue synchronized through the kse_lock spinlock which is freed at any thread_alloc() / thread_wait() through a call to thread_reap(). Note that this approach is not ideal as we should want a per-process list of zombie upcalls, but it follows initial guidelines of KSE authors. Tested by: jkim, pav Approved by: jeff, julian Approved by: re
# 171556	23-Jul-2007	attilio	Actually, KSE kernel bits locking is broken and can lead likely to dangerous races. Fix this problems adding correct locking for the members of 'struct kse_upcall' and other struct proc/struct thread related members. For the moment, just leave ku_mflag and ku_flags "lazy" locked. While here, cleanup the code removing the function kse_GC() (unused), and merging upcall_link(), upcall_unlink(), upcall_stash() in their respective callers (static functions, very short and only called in one place). Reported by: pav Tested by: pav (on some pointyhat cluster nodes) Approved by: jeff Approved by: re Sponsorized by: NGX Italy (http://www.ngx.it)
# 171549	22-Jul-2007	attilio	Preprocessing stub "KSE" breaks ABI either with modules and userspace consumers. This patch makes KSE no more an optionally stub for kernel structures fixing the breakage. As a tail note, this bug has broken kqemu for a long period now. Tested by: Ulf Lilleengen <lulf@FreeBSD.org> Discussed with: rwatson, jeff Approved by: jeff (mentor) Approved by: re
# 170633	12-Jun-2007	jeff	- Fix kse by moving the upcalls list back out of the zero'd section. I had tested this with the wrong libpthread.
# 170630	12-Jun-2007	jeff	- Garbage collect unused concurrency functions. - Remove unused kse fields from struct proc. - Group remaining fields and #ifdef KSE them. - Move some kern_kse.c only prototypes out of proc and into kern_kse. Discussed with: Julian
# 170598	12-Jun-2007	jeff	Solve a complex exit race introduced with thread_lock: - Add a count of exiting threads, p_exitthreads, to struct proc. - Increment p_exithreads when we set the deadthread in thread_exit(). - When we thread_stash() a deadthread use an atomic to drop the count. - Spin until the p_exithreads count reaches 0 in thread_wait(). - Lock the last exiting thread momentarily to be certain that it has exited cpu_throw(). - Restructure thread_wait(). It does not need a loop as there will only ever be one thread. Tested by: moose@opera.com Reported by: kris, moose@opera.com
# 170584	11-Jun-2007	jeff	- Move p_ru to the zero'd section of the proc to keep stats accurate.
# 170466	09-Jun-2007	attilio	The current rusage code show peculiar problems: - Unsafeness on ruadd() in thread_exit() - Unatomicity of thread_exiit() in the exit1() operations This patch addresses these problems allocating p_fd as part of the process and modifying the way it is accessed. A small chunk of this patch, resolves a race about p_state in kern_wait(), since we have to be sure about the zombif-ing process. Submitted by: jeff Approved by: jeff (mentor)
# 170407	07-Jun-2007	rwatson	Move per-process audit state from a pointer in the proc structure to embedded storage in struct ucred. This allows audit state to be cached with the thread, avoiding locking operations with each system call, and makes it available in asynchronous execution contexts, such as deep in the network stack or VFS. Reviewed by: csjp Approved by: re (kensmith) Obtained from: TrustedBSD Project
# 170358	06-Jun-2007	jeff	- Placing the 'volatile' on the right side of the * in the td_lock declaration removes the need for __DEVOLATILE(). Pointed out by: tegge
# 170293	04-Jun-2007	jeff	Commit 1/14 of sched_lock decomposition. - Move all scheduler locking into the schedulers utilizing a technique similar to solaris's container locking. - A per-process spinlock is now used to protect the queue of threads, thread count, suspension count, p_sflags, and other process related scheduling fields. - The new thread lock is actually a pointer to a spinlock for the container that the thread is currently owned by. The container may be a turnstile, sleepqueue, or run queue. - thread_lock() is now used to protect access to thread related scheduling fields. thread_unlock() unlocks the lock and thread_set_lock() implements the transition from one lock to another. - A new "blocked_lock" is used in cases where it is not safe to hold the actual thread's lock yet we must prevent access to the thread. - sched_throw() and sched_fork_exit() are introduced to allow the schedulers to fix-up locking at these points. - Add some minor infrastructure for optionally exporting scheduler statistics that were invaluable in solving performance problems with this patch. Generally these statistics allow you to differentiate between different causes of context switches. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
# 170174	31-May-2007	jeff	- Move rusage from being per-process in struct pstats to per-thread in td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits. Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)
# 169006	24-Apr-2007	kib	Disable nesting of BOP_BDFLUSH(). VOP_FSYNC() call in bdwrite() could result in bdwrite() being reentered, thus causing infinite recursion. Reported and tested by: Peter Holm Reviewed by: tegge MFC after: 2 weeks
# 168756	15-Apr-2007	des	Add macros to assert that the process is / isn't held in memory. MFC after: 3 weeks
# 168688	13-Apr-2007	csjp	Fix the handling of IPv6 addresses for subject and process BSM audit tokens. Currently, we do not support the set{get}audit_addr(2) system calls which allows processes like sshd to set extended or ip6 information for subject tokens. The approach that was taken was to change the process audit state slightly to use an extended terminal ID in the kernel. This allows us to store both IPv4 IPv6 addresses. In the case that an IPv4 address is in use, we convert the terminal ID from an struct auditinfo_addr to a struct auditinfo. If getaudit(2) is called when the subject is bound to an ip6 address, we return E2BIG. - Change the internal audit record to store an extended terminal ID - Introduce ARG_TERMID_ADDR - Change the kaudit <-> BSM conversion process so that we are using the appropriate subject token. If the address associated with the subject is IPv4, we use the standard subject32 token. If the subject has an IPv6 address associated with them, we use an extended subject32 token. - Fix a couple of endian issues where we do a couple of byte swaps when we shouldn't be. IP addresses are already in the correct byte order, so reading the ip6 address 4 bytes at a time and swapping them results in in-correct address data. It should be noted that the same issue was found in the openbsm library and it has been changed there too on the vendor branch - Change A_GETPINFO to use the appropriate structures - Implement A_GETPINFO_ADDR which basically does what A_GETPINFO does, but can also handle ip6 addresses - Adjust get{set}audit(2) syscalls to convert the data auditinfo <-> auditinfo_addr - Fully implement set{get}audit_addr(2) NOTE: This adds the ability for processes to correctly set extended subject information. The appropriate userspace utilities still need to be updated. MFC after: 1 month Reviewed by: rwatson Obtained from: TrustedBSD
# 167787	21-Mar-2007	jhb	Rename the 'mtx_object', 'rw_object', and 'sx_object' members of mutexes, rwlocks, and sx locks to 'lock_object'.
# 167352	09-Mar-2007	mohans	Over NFS, an open() call could result in multiple over-the-wire GETATTRs being generated - one from lookup()/namei() and the other from nfs_open() (for cto consistency). This change eliminates the GETATTR in nfs_open() if an otw GETATTR was done from the namei() path. Instead of extending the vop interface, we timestamp each attr load, and use this to detect whether a GETATTR was done from namei() for this syscall. Introduces a thread-local variable that counts the syscalls made by the thread and uses <pid, tid, thread syscalls> as the attrload timestamp. Thanks to jhb@ and peter@ for a discussion on thread state that could be used as the timestamp with minimal overhead.
# 167327	08-Mar-2007	julian	Instead of doing comparisons using the pcpu area to see if a thread is an idle thread, just see if it has the IDLETD flag set. That flag will probably move to the pflags word as it's permenent and never chenges for the life of the system so it doesn't need locking.
# 166188	23-Jan-2007	jeff	- Remove setrunqueue and replace it with direct calls to sched_add(). setrunqueue() was mostly empty. The few asserts and thread state setting were moved to the individual schedulers. sched_add() was chosen to displace it for naming consistency reasons. - Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be different on all three schedulers where it was only called in one place each. - Remove the long ifdef'd out remrunqueue code. - Remove the now redundant ts_state. Inspect the thread state directly. - Don't set TSF_* flags from kern_switch.c, we were only doing this to support a feature in one scheduler. - Change sched_choose() to return a thread rather than a td_sched. Also, rely on the schedulers to return the idlethread. This simplifies the logic in choosethread(). Aside from the run queue links kern_switch.c mostly does not care about the contents of td_sched. Discussed with: julian - Move the idle thread loop into the per scheduler area. ULE wants to do something different from the other schedulers. Suggested by: jhb Tested on: x86/amd64 sched_{4BSD, ULE, CORE}.
# 165760	04-Jan-2007	jeff	- Add SRQ_BORROWING to the list of switch reasons. ULE is the only consumer at this time. It is used to optimize the run queue placement of threads who have newly elevated priorities.
# 165272	16-Dec-2006	kmacy	Add second sleep queue so that sx and lockmgr can have separate sleep queues for shared and exclusive acquisitions Submitted by: Attilio Rao Approved by: jhb
# 165261	15-Dec-2006	flz	"Paralleled" should have been "parceled". Pointed out by: julian Relayed by: rdivacky
# 165247	15-Dec-2006	pav	Fix typos in comment block Submitted by: rdivacky
# 164936	06-Dec-2006	julian	Threading cleanup.. part 2 of several. Make part of John Birrell's KSE patch permanent.. Specifically, remove: Any reference of the ksegrp structure. This feature was never fully utilised and made things overly complicated. All code in the scheduler that tried to make threaded programs fair to unthreaded programs. Libpthread processes will already do this to some extent and libthr processes already disable it. Also: Since this makes such a big change to the scheduler(s), take the opportunity to rename some structures and elements that had to be moved anyhow. This makes the code a lot more readable. The ULE scheduler compiles again but I have no idea if it works. The 4bsd scheduler still reqires a little cleaning and some functions that now do ALMOST nothing will go away, but I thought I'd do that as a separate commit. Tested by David Xu, and Dan Eischen using libthr and libpthread.
# 164876	04-Dec-2006	davidxu	if a thread blocked on userland condition variable is pthread_cancel()ed, it is expected that the thread will not consume a pthread_cond_signal(), therefor, we use thr_wake() to mark a flag, the flag tells a thread calling do_cv_wait() in umtx code to not block on a condition variable. Thread library is expected that once a thread detected itself is in pthread_cond_wait, it will call the thr_wake() for itself in its SIGCANCEL handler.
# 164230	12-Nov-2006	ceri	Correct typos in comments.
# 163775	29-Oct-2006	jb	Add the padding fields to 'struct proc' for the !KSE case that I missed. Noticed by: pjd
# 163714	27-Oct-2006	davidxu	Remove member p_procscopegrp which is no longer used by libthr.
# 163709	26-Oct-2006	jb	Make KSE a kernel option, turned on by default in all GENERIC kernel configs except sun4v (which doesn't process signals properly with KSE). Reviewed by: davidxu@
# 162968	02-Oct-2006	jhb	Update description of td_locks. MFC after: 3 days Requested by: pjd
# 161596	25-Aug-2006	davidxu	Add member kg_base_user_pri and flag TDF_UBORROWING, they will be used to support userland priority propagation for 1:1 threading.
# 158718	18-May-2006	davidxu	Move flag TDF_UMTXQ into structure umtxq, this eliminates the requirement of scheduler lock in some umtx code.
# 156680	13-Mar-2006	davidxu	Remove unused code.
# 156570	11-Mar-2006	phk	Go over calcru and friends once more. Reintroduce the monotonicity for the normal case and make the two special cases behave in what is belived to be the most sensible fasion.
# 156117	28-Feb-2006	jhb	Allow PHOLD()'s of curproc even if P_WEXIT is set. Normally we don't want to allow PHOLD()'s of processes that have P_WEXIT set as once that flag is set we aren't guaranteed to block in exit1() waiting for the PRELE() (we might already be past the wait). However, curproc is a bit of a special case. By the time P_WEXIT is set, the process is single-threaded, so the only thread for which can do a PHOLD(curproc) is the thread executing in exit1(). The fact that this thread is executing ensures that the process won't go away before the current hold is released via PRELE(). This fixes some panics due to kicking off softupdate operations inside of exit1() after the recent PHOLD changes to fix ptrace/procfs vs exit races. MFC after: 1 week Tested by: pho
# 155922	22-Feb-2006	jhb	Close some races between procfs/ptrace and exit(2): - Reorder the events in exit(2) slightly so that we trigger the S_EXIT stop event earlier. After we have signalled that, we set P_WEXIT and then wait for any processes with a hold on the vmspace via PHOLD to release it. PHOLD now KASSERT()'s that P_WEXIT is clear when it is invoked, and PRELE now does a wakeup if P_WEXIT is set and p_lock drops to zero. - Change proc_rwmem() to require that the processing read from has its vmspace held via PHOLD by the caller and get rid of all the junk to screw around with the vmspace reference count as we no longer need it. - In ptrace() and pseudofs(), treat a process with P_WEXIT set as if it doesn't exist. - Only do one PHOLD in kern_ptrace() now, and do it earlier so it covers FIX_SSTEP() (since on alpha at least this can end up calling proc_rwmem() to clear an earlier single-step simualted via a breakpoint). We only do one to avoid races. Also, by making the EINVAL error for unknown requests be part of the default: case in the switch, the various switch cases can now just break out to return which removes a _lot_ of duplicated PRELE and proc unlocks, etc. Also, it fixes at least one bug where a LWP ptrace command could return EINVAL with the proc lock still held. - Changed the locking for ptrace_single_step(), ptrace_set_pc(), and ptrace_clear_single_step() to always be called with the proc lock held (it was a mixed bag previously). Alpha and arm have to drop the lock while the mess around with breakpoints, but other archs avoid extra lock release/acquires in ptrace(). I did have to fix a couple of other consumers in kern_kse and a few other places to hold the proc lock and PHOLD. Tested by: ps (1 mostly, but some bits of 2-4 as well) MFC after: 1 week
# 155741	15-Feb-2006	davidxu	Fix a long standing race between sleep queue and thread suspension code. When a thread A is going to sleep, it calls sleepq_catch_signals() to detect any pending signals or thread suspension request, if nothing happens, it returns without holding process lock or scheduler lock, this opens a race window which allows thread B to come in and do process suspension work, however since A is still at running state, thread B can do nothing to A, thread A continues, and puts itself into actually sleeping state, but B has never seen it, and it sits there forever until B is woken up by other threads sometimes later(this can be very long delay or never happen). Fix this bug by forcing sleepq_catch_signals to return with scheduler lock held. Fix sleepq_abort() by passing it an interrupted code, previously, it worked as wakeup_one(), and the interruption can not be identified correctly by sleep queue code when the sleeping thread is resumed. Let thread_suspend_check() returns EINTR or ERESTART, so sleep queue no longer has to use SIGSTOP as a hack to build a return value. Reviewed by: jhb MFC after: 1 week
# 155534	11-Feb-2006	phk	CPU time accounting speedup (step 2) Keep accounting time (in per-cpu) cputicks and the statistics counts in the thread and summarize into struct proc when at context switch. Don't reach across CPUs in calcru(). Add code to calibrate the top speed of cpu_tickrate() for variable cpu_tick hardware (like TSC on power managed machines). Don't enforce monotonicity (at least for now) in calcru. While the calibrated cpu_tickrate ramps up it may not be true. Use 27MHz counter on i386/Geode. Use TSC on amd64 & i386 if present. Use tick counter on sparc64
# 155455	08-Feb-2006	phk	Simplify system time accounting for profiling. Rename struct thread's td_sticks to td_pticks, we will need the other name for more appropriately named use shortly. Reduce it from uint64_t to u_int. Clear td_pticks whenever we enter the kernel instead of recording its value as reference for userret(). Use the absolute value of td->pticks in userret() and eliminate third argument.
# 155444	07-Feb-2006	phk	Modify the way we account for CPU time spent (step 1) Keep track of time spent by the cpu in various contexts in units of "cputicks" and scale to real-world microsec^H^H^H^H^H^H^H^Hclock_t only when somebody wants to inspect the numbers. For now "cputicks" are still derived from the current timecounter and therefore things should by definition remain sensible also on SMP machines. (The main reason for this first milestone commit is to verify that hypothesis.) On slower machines, the avoided multiplications to normalize timestams at every context switch, comes out as a 5-7% better score on the unixbench/context1 microbenchmark. On more modern hardware no change in performance is seen.
# 155195	01-Feb-2006	rwatson	Add new fields to process-related data structures: - td_ar to struct thread, which holds the in-progress audit record during a system call. - p_au to struct proc, which holds per-process audit state, such as the audit identifier, audit terminal, and process audit masks. In the earlier implementation, td_ar was added to the zero'd section of struct thread. In order to facilitate merging to RELENG_6, it has been moved to the end of the data structure, requiring explicit initalization in the thread constructor. Much help from: wsalamon Obtained from: TrustedBSD Project
# 154938	27-Jan-2006	jhb	Oops, commit missed file from the previous change to enable multiple queues in turnstiles. Add a new thread member td_tsqueue which contains the sub-queue of a turnstile that a thread is on when it is blocked on a turnstile.
# 154534	18-Jan-2006	julian	Duh! put the thread name into the section that is zero'd on allocation (by default there is no name)
# 154533	18-Jan-2006	julian	Congratulations, we now have a place for a thread to store its name. Will be needed when we start using real threads in the kernel instead of whole pseudo-processes.
# 154481	17-Jan-2006	jhb	Update a stale comment.
# 152948	30-Nov-2005	davidxu	Last step to make mq_notify conform to POSIX standard, If the process has successfully attached a notification request to the message queue via a queue descriptor, file closing should remove the attachment.
# 152376	13-Nov-2005	rwatson	Moderate rewrite of kernel ktrace code to attempt to generally improve reliability when tracing fast-moving processes or writing traces to slow file systems by avoiding unbounded queueuing and dropped records. Record loss was previously possible when the global pool of records become depleted as a result of record generation outstripping record commit, which occurred quickly in many common situations. These changes partially restore the 4.x model of committing ktrace records at the point of trace generation (synchronous), but maintain the 5.x deferred record commit behavior (asynchronous) for situations where entering VFS and sleeping is not possible (i.e., in the scheduler). Records are now queued per-process as opposed to globally, with processes responsible for committing records from their own context as required. - Eliminate the ktrace worker thread and global record queue, as they are no longer used. Keep the global free record list, as records are still used. - Add a per-process record queue, which will hold any asynchronously generated records, such as from context switches. This replaces the global queue as the place to submit asynchronous records to. - When a record is committed asynchronously, simply queue it to the process. - When a record is committed synchronously, first drain any pending per-process records in order to maintain ordering as best we can. Currently ordering between competing threads is provided via a global ktrace_sx, but a per-process flag or lock may be desirable in the future. - When a process returns to user space following a system call, trap, signal delivery, etc, flush any pending records. - When a process exits, flush any pending records. - Assert on process tear-down that there are no pending records. - Slightly abstract the notion of being "in ktrace", which is used to prevent the recursive generation of records, as well as generating traces for ktrace events. Future work here might look at changing the set of events marked for synchronous and asynchronous record generation, re-balancing queue depth, timeliness of commit to disk, and so on. I.e., performing a drain every (n) records. MFC after: 1 month Discussed with: jhb Requested by: Marc Olzheim <marcolz at stack dot nl>
# 152185	08-Nov-2005	davidxu	Add support for queueing SIGCHLD same as other UNIX systems did. For each child process whose status has been changed, a SIGCHLD instance is queued, if the signal is stilling pending, and process changed status several times, signal information is updated to reflect latest process status. If wait() returns because the status of a child process is available, pending SIGCHLD signal associated with the child process is discarded. Any other pending SIGCHLD signals remain pending. The signal information is allocated at the same time when proc structure is allocated, if process signal queue is fully filled or there is a memory shortage, it can still send the signal to process. There is a booting time tunable kern.sigqueue.queue_sigchild which can control the behavior, setting it to zero disables the SIGCHLD queueing feature, the tunable will be removed if the function is proved that it is stable enough. Tested on: i386 (SMP and UP)
# 151990	02-Nov-2005	davidxu	Add thread_find() function to search a thread by lwpid.
# 151658	25-Oct-2005	jhb	Reorganize the interrupt handling code a bit to make a few things cleaner and increase flexibility to allow various different approaches to be tried in the future. - Split struct ithd up into two pieces. struct intr_event holds the list of interrupt handlers associated with interrupt sources. struct intr_thread contains the data relative to an interrupt thread. Currently we still provide a 1:1 relationship of events to threads with the exception that events only have an associated thread if there is at least one threaded interrupt handler attached to the event. This means that on x86 we no longer have 4 bazillion interrupt threads with no handlers. It also means that interrupt events with only INTR_FAST handlers no longer have an associated thread either. - Renamed struct intrhand to struct intr_handler to follow the struct intr_foo naming convention. This did require renaming the powerpc MD struct intr_handler to struct ppc_intr_handler. - INTR_FAST no longer implies INTR_EXCL on all architectures except for powerpc. This means that multiple INTR_FAST handlers can attach to the same interrupt and that INTR_FAST and non-INTR_FAST handlers can attach to the same interrupt. Sharing INTR_FAST handlers may not always be desirable, but having sio(4) and uhci(4) fight over an IRQ isn't fun either. Drivers can always still use INTR_EXCL to ask for an interrupt exclusively. The way this sharing works is that when an interrupt comes in, all the INTR_FAST handlers are executed first, and if any threaded handlers exist, the interrupt thread is scheduled afterwards. This type of layout also makes it possible to investigate using interrupt filters ala OS X where the filter determines whether or not its companion threaded handler should run. - Aside from the INTR_FAST changes above, the impact on MD interrupt code is mostly just 's/ithread/intr_event/'. - A new MI ddb command 'show intrs' walks the list of interrupt events dumping their state. It also has a '/v' verbose switch which dumps info about all of the handlers attached to each event. - We currently don't destroy an interrupt thread when the last threaded handler is removed because it would suck for things like ppbus(8)'s braindead behavior. The code is present, though, it is just under #if 0 for now. - Move the code to actually execute the threaded handlers for an interrrupt event into a separate function so that ithread_loop() becomes more readable. Previously this code was all in the middle of ithread_loop() and indented halfway across the screen. - Made struct intr_thread private to kern_intr.c and replaced td_ithd with a thread private flag TDP_ITHREAD. - In statclock, check curthread against idlethread directly rather than curthread's proc against idlethread's proc. (Not really related to intr changes) Tested on: alpha, amd64, i386, sparc64 Tested on: arm, ia64 (older version of patch by cognet and marcel)
# 151585	23-Oct-2005	davidxu	Make p_itimers as a pointer, so file sys/proc.h does not need to include sys/timers.h.
# 151576	23-Oct-2005	davidxu	Implement POSIX timers. Current only CLOCK_REALTIME and CLOCK_MONOTONIC clock are supported. I have plan to merge XSI timer ITIMER_REAL and other two CPU timers into the new code, current three slots are available for the XSI timers. The SIGEV_THREAD notification type is not supported yet because our sigevent struct lacks of two member fields: sigev_notify_function sigev_notify_attributes I have found the sigevent is used in AIO, so I won't add the two members unless the AIO code is adjusted.
# 151316	14-Oct-2005	davidxu	1. Change prototype of trapsignal and sendsig to use ksiginfo_t *, most changes in MD code are trivial, before this change, trapsignal and sendsig use discrete parameters, now they uses member fields of ksiginfo_t structure. For sendsig, this change allows us to pass POSIX realtime signal value to user code. 2. Remove cpu_thread_siginfo, it is no longer needed because we now always generate ksiginfo_t data and feed it to libpthread. 3. Add p_sigqueue to proc structure to hold shared signals which were blocked by all threads in the proc. 4. Add td_sigqueue to thread structure to hold all signals delivered to thread. 5. i386 and amd64 now return POSIX standard si_code, other arches will be fixed. 6. In this sigqueue implementation, pending signal set is kept as before, an extra siginfo list holds additional siginfo_t data for signals. kernel code uses psignal() still behavior as before, it won't be failed even under memory pressure, only exception is when deleting a signal, we should call sigqueue_delete to remove signal from sigqueue but not SIGDELSET. Current there is no kernel code will deliver a signal with additional data, so kernel should be as stable as before, a ksiginfo can carry more information, for example, allow signal to be delivered but throw away siginfo data if memory is not enough. SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can not be caught or masked. The sigqueue() syscall allows user code to queue a signal to target process, if resource is unavailable, EAGAIN will be returned as specification said. Just before thread exits, signal queue memory will be freed by sigqueue_flush. Current, all signals are allowed to be queued, not only realtime signals. Earlier patch reviewed by: jhb, deischen Tested on: i386, amd64
# 150741	29-Sep-2005	truckman	Un-staticize runningbufwakeup() and staticize updateproc. Add a new private thread flag to indicate that the thread should not sleep if runningbufspace is too large. Set this flag on the bufdaemon and syncer threads so that they skip the waitrunningbufspace() call in bufwrite() rather than than checking the proc pointer vs. the known proc pointers for these two threads. A way of preventing these threads from being starved for I/O but still placing limits on their outstanding I/O would be desirable. Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from blocking on the runningbufspace check while holding snaplk. This prevents snaplk from being held for an arbitrarily long period of time if runningbufspace is high and greatly reduces the contention for snaplk. The disadvantage is that ffs_copyonwrite() can start a large amount of I/O if there are a large number of snapshots, which could cause a deadlock in other parts of the code. Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace before attempting to grab snaplk so that I/O requests waiting on snaplk are not counted in runningbufspace as being in-progress. Increment runningbufspace again before actually launching the original I/O request. Prior to the above two changes, the system could deadlock if enough I/O requests were blocked by snaplk to prevent runningbufspace from falling below lorunningspace and one of the bawrite() calls in ffs_copyonwrite() blocked in waitrunningbufspace() while holding snaplk. See <http://www.holm.cc/stress/log/cons143.html>
# 150630	27-Sep-2005	jhb	Use the refcount API to implement reference counts on process argument structures rather than using a global mutex to protect the reference counts. Tested on: i386, alpha, sparc64
# 150177	15-Sep-2005	jhb	- Add a new simple facility for marking the current thread as being in a state where sleeping on a sleep queue is not allowed. The facility doesn't support recursion but uses a simple private per-thread flag (TDP_NOSLEEPING). The sleepq_add() function will panic if the flag is set and INVARIANTS is enabled. - Use this new facility to replace the g_xup and g_xdown mutexes that were (ab)used to achieve similar behavior. - Disallow sleeping in interrupt threads when invoking interrupt handlers. MFC after: 1 week Reviewed by: phk
# 148920	10-Aug-2005	obrien	Remove public declarations of variables that were forgotten when they were made static.
# 147889	10-Jul-2005	davidxu	Validate if the value written into {FS,GS}.base is a canonical address, writting non-canonical address can cause kernel a panic, by restricting base values to 0..VM_MAXUSER_ADDRESS, ensuring only canonical values get written to the registers. Reviewed by: peter, Josepha Koshy < joseph.koshy at gmail dot com > Approved by: re (scottl)
# 146813	30-May-2005	alc	Use the proc mtx to prevent simultaneous changes to p_aioinfo.
# 146687	27-May-2005	davidxu	Remove thread_upcall_check, it was used to avoid race bug in earlier day's sleep queue code, today the bug no longer exists. please see 04/25/2004 freebsd-threads@ mailing list archive.
# 146554	23-May-2005	ups	Use low level constructs borrowed from interrupt threads to wait for work in proc0. Remove the TDP_WAKEPROC0 workaround.
# 145433	23-Apr-2005	davidxu	Change cpu_set_kse_upcall to more generic style, so we can reuse it in other codes. Add cpu_set_user_tls, use it to tweak user register and setup user TLS. I ever wanted to merge it into cpu_set_kse_upcall, but since cpu_set_kse_upcall is also used by M:N threads which may not need this feature, so I wrote a separated cpu_set_user_tls.
# 145260	19-Apr-2005	davidxu	Fix a race condition between kern_wait() and thread_stopped(). Problem is in kern_wait(), parent process steps through children list, once a child process is skipped, and later even if the child is stopped, parent process still sleeps in msleep(), the race happens if parent masked SIGCHLD. Submitted by : Peter Edwards peadar.edwards at gmail dot com MFC after : 4 days
# 145256	19-Apr-2005	jkoshy	Bring a working snapshot of hwpmc(4), its associated libraries, userland utilities and documentation into -CURRENT. Bump FreeBSD_version. Reviewed by: alc, jhb (kernel changes)
# 145234	18-Apr-2005	rwatson	Introduce p_canwait() and MAC Framework and MAC Policy entry points mac_check_proc_wait(), which control the ability to wait4() specific processes. This permits MAC policies to limit information flow from children that have changed label, although has to be handled carefully due to common programming expectations regarding the behavior of wait4(). The cr_seeotheruids() check in p_canwait() is #if 0'd for this reason. The mac_stub and mac_test policies are updated to reflect these new entry points. Sponsored by: SPAWAR, SPARTA Obtained from: TrustedBSD Project
# 144777	08-Apr-2005	ups	Sprinkle some volatile magic and rearrange things a bit to avoid race conditions in critical_exit now that it no longer blocks interrupts. Reviewed by: jhb
# 144444	31-Mar-2005	jhb	Bring back the WITNESS_WARN() check to _STOPEVENT() as all the callers have been fixed for quite a while now.
# 144118	25-Mar-2005	das	When the softupdates worklist gets too long, threads that attempt to add more work are forced to process two worklist items first. However, processing an item may generate additional work, causing the unlucky thread to recursively process the worklist. Add a per-thread flag to detect this situation and avoid the recursion. This should fix the stack overflows that could occur while removing large directory trees. Tested by: kris Reviewed by: mckusick
# 143740	17-Mar-2005	phk	In stange circumstances we may end up being the last reference to a session in tprintf(). SESSRELE() needs to properly dispose of the sessions mutex. Add sessrele() which does the proper cleanup and have SESSRELE() call it. Use SESSRELE also in pgdelete(). Found by: Coverity (ID:526)
# 143149	05-Mar-2005	davidxu	Allocate umtx_q from heap instead of stack, this avoids page fault panic in kernel under heavy swapping.
# 143144	04-Mar-2005	davidxu	The td_waitset is pointing to a stack address when thread is waiting for a signal, because kernel stack is swappable, this causes page fault in kernel under heavy swapping case. Fix this bug by eliminating unneeded code.
# 141815	13-Feb-2005	sobomax	Backout previous change (disabling of security checks for signals delivered in emulation layers), since it appears to be too broad. Requested by: rwatson
# 141812	13-Feb-2005	sobomax	Split out kill(2) syscall service routine into user-level and kernel part, the former is callable from user space and the latter from the kernel one. Make kernel version take additional argument which tells if the respective call should check for additional restrictions for sending signals to suid/sugid applications or not. Make all emulation layers using non-checked version, since signal numbers in emulation layers can have different meaning that in native mode and such protection can cause misbehaviour. As a result remove LIBTHR from the signals allowed to be delivered to a suid/sugid application. Requested (sorta) by: rwatson MFC after: 2 weeks
# 139786	06-Jan-2005	jhb	Bit 0 of td_flags is now used by the priority borrowing flag, so remove the unused placeholder constant.
# 139453	30-Dec-2004	jhb	Rework the interface between priority propagation (lending) and the schedulers a bit to ensure more correct handling of priorities and fewer priority inversions: - Add two functions to the sched(9) API to handle priority lending: sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these functions to ask the scheduler to lend a thread a set priority and to tell the scheduler when it thinks it is ok for a thread to stop borrowing priority. The unlend case is slightly complex in that the turnstile code tells the scheduler what the minimum priority of the thread needs to be to satisfy the requirements of any other threads blocked on locks owned by the thread in question. The scheduler then decides where the thread can go back to normal mode (if it's normal priority is high enough to satisfy the pending lock requests) or it it should continue to use the priority specified to the sched_unlend_prio() call. This involves adding a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag for priority elevation. - Schedulers now refuse to lower the priority of a thread that is currently borrowing another therad's priority. - If a scheduler changes the priority of a thread that is currently sitting on a turnstile, it will call a new function turnstile_adjust() to inform the turnstile code of the change. This function resorts the thread on the priority list of the turnstile if needed, and if the thread ends up at the head of the list (due to having the highest priority) and its priority was raised, then it will propagate that new priority to the owner of the lock it is blocked on. Some additional fixes specific to the 4BSD scheduler include: - Common code for updating the priority of a thread when the user priority of its associated kse group has been consolidated in a new static function resetpriority_thread(). One change to this function is that it will now only adjust the priority of a thread if it already has a time sharing priority, thus preserving any boosts from a tsleep() until the thread returns to userland. Also, resetpriority() no longer calls maybe_resched() on each thread in the group. Instead, the code calling resetpriority() is responsible for calling resetpriority_thread() on any threads that need to be updated. - schedcpu() now uses resetpriority_thread() instead of just calling sched_prio() directly after it updates a kse group's user priority. - sched_clock() now uses resetpriority_thread() rather than writing directly to td_priority. - sched_nice() now updates all the priorities of the threads after the group priority has been adjusted. Discussed with: bde Reviewed by: ups, jeffr Tested on: 4bsd, ule Tested on: i386, alpha, sparc64
# 139013	18-Dec-2004	davidxu	1. make umtx sharable between processes, the way is two or more processes call mmap() to create a shared space, and then initialize umtx on it, after that, each thread in different processes can use the umtx same as threads in same process. 2. introduce a new syscall _umtx_op to support timed lock and condition variable semantics. also, orignal umtx_lock and umtx_unlock inline functions now are reimplemented by using _umtx_op, the _umtx_op can use arbitrary id not just a thread id.
# 138843	14-Dec-2004	jeff	- Garbage collect several unused members of struct kse and struce ksegrp. As best as I can tell, some of these were never used.
# 137924	20-Nov-2004	das	Remove the p_uarea and p_upages_obj fields from struct proc. Prototype new routines to allocate, copy, and free pstats. Reviewed by: arch@
# 136837	23-Oct-2004	phk	Add a new per-thread private flag: TDP_GEOM. This flag gets set whenever the thread posts an event on the GEOM event queue, and if the flag is set when the thread is prepared to return to userland from the kernel, g_waitidle() will be called to make sure that the posted events have completed. This can replace an insufficient number of g_waitidle() calls in various other places, and has the advantage of being failsafe: Any system call which does a VOP_OPEN()/VOP_CLOSE will now correctly wait for any geom events it posted as part of spoils or tastes. Assert that topology and Giant is not held in g_waitidle().
# 136583	16-Oct-2004	scottl	If a process needs to be swapped in, wakeup the swapper from within critical_exit as the process is getting scheduled to run. This is subotimal but for now avoid the LOR between the scheduler and the sleepq systems. This is a 5.3 candidate. Submitted by: davidxu MFC After: 3 days
# 136177	05-Oct-2004	davidxu	In original kern_execve() code, at the start of the function, it forces all other threads to suicide, problem is execve() could be failed, and a failed execve() would change threaded process to unthreaded, this side effect is unexpected. The new code introduces a new single threading mode SINGLE_BOUNDARY, in the mode, all threads should suspend themself at user boundary except the singler. we can not use SINGLE_NO_EXIT because we want to start from a clean state if execve() is successful, suspending other threads at unknown point and later resuming them from there and forcing them to exit at user boundary may cause the process to start from a dirty state. If execve() is successful, current thread upgrades to SINGLE_EXIT mode and forces other threads to suicide at user boundary, otherwise, other threads will be resumed and their interrupted syscall will be restarted. Reviewed by: julian
# 136170	05-Oct-2004	julian	When preempting a thread, put it back on the HEAD of its run queue. (Only really implemented in 4bsd) MFC after: 4 days
# 136160	05-Oct-2004	julian	Break out to a separate function, the code to revert a multithreaded process back to officially being a non-threaded program. MFC after: 4 days
# 136152	05-Oct-2004	jhb	Rework how we store process times in the kernel such that we always store the raw values including for child process statistics and only compute the system and user timevals on demand. - Fix the various kern_wait() syscall wrappers to only pass in a rusage pointer if they are going to use the result. - Add a kern_getrusage() function for the ABI syscalls to use so that they don't have to play stackgap games to call getrusage(). - Fix the svr4_sys_times() syscall to just call calcru() to calculate the times it needs rather than calling getrusage() twice with associated stackgap, etc. - Add a new rusage_ext structure to store raw time stats such as tick counts for user, system, and interrupt time as well as a bintime of the total runtime. A new p_rux field in struct proc replaces the same inline fields from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime). A new p_crux field in struct proc contains the "raw" child time usage statistics. ruadd() has been changed to handle adding the associated rusage_ext structures as well as the values in rusage. Effectively, the values in rusage_ext replace the ru_utime and ru_stime values in struct rusage. These two fields in struct rusage are no longer used in the kernel. - calcru() has been split into a static worker function calcru1() that calculates appropriate timevals for user and system time as well as updating the rux_[isu]u fields of a passed in rusage_ext structure. calcru() uses a copy of the process' p_rux structure to compute the timevals after updating the runtime appropriately if any of the threads in that process are currently executing. It also now only locks sched_lock internally while doing the rux_runtime fixup. calcru() now only requires the caller to hold the proc lock and calcru1() only requires the proc lock internally. calcru() also no longer allows callers to ask for an interrupt timeval since none of them actually did. - calcru() now correctly handles threads executing on other CPUs. - A new calccru() function computes the child system and user timevals by calling calcru1() on p_crux. Note that this means that any code that wants child times must now call this function rather than reading from p_cru directly. This function also requires the proc lock. - This finishes the locking for rusage and friends so some of the Giant locks in exit1() and kern_wait() are now gone. - The locking in ttyinfo() has been tweaked so that a shared lock of the proctree lock is used to protect the process group rather than the process group lock. By holding this lock until the end of the function we now ensure that the process/thread that we pick to dump info about will no longer vanish while we are trying to output its info to the console. Submitted by: bde (mostly) MFC after: 1 month
# 135960	30-Sep-2004	alfred	Forward declare struct kaioinfo to un-void a pointer in struct proc.
# 135688	23-Sep-2004	jhb	A modest collection of various and sundry style, spelling, and whitespace fixes. Submitted by: bde (mostly)
# 135636	23-Sep-2004	jhb	Update locking notes on several fields to reflect locking already in the tree: - td_standin is (k + a) as it is only touched by either curthread or when a thread is being created. - td_upcall is (k + j) - td_sticks is (k) rather than the earlier (j) note. - td_uuticks and td_usticks are both (k). - td_intrval is (j) - Neither kg_nextupcall or kg_upquantum seem to be locked and that seems to be on purpose, so mark those as (n).
# 135170	13-Sep-2004	jhb	Tidy a comment.
# 135076	11-Sep-2004	scottl	Revert the previous round of changes to td_pinned. The scheduler isn't fully initialed when the pmap layer tries to call sched_pini() early in the boot and results in an quick panic. Use ke_pinned instead as was originally done with Tor's patch. Approved by: julian
# 135056	10-Sep-2004	julian	Make up my mind if cpu pinning is stored in the thread structure or the scheduler specific extension to it. Put it in the extension as the implimentation details of how the pinning is done needn't be visible outside the scheduler. Submitted by: tegge (of course!) (with changes) MFC after: 3 days
# 134791	05-Sep-2004	julian	Refactor a bunch of scheduler code to give basically the same behaviour but with slightly cleaned up interfaces. The KSE structure has become the same as the "per thread scheduler private data" structure. In order to not make the diffs too great one is #defined as the other at this time. The KSE (or td_sched) structure is now allocated per thread and has no allocation code of its own. Concurrency for a KSEGRP is now kept track of via a simple pair of counters rather than using KSE structures as tokens. Since the KSE structure is different in each scheduler, kern_switch.c is now included at the end of each scheduler. Nothing outside the scheduler knows the contents of the KSE (aka td_sched) structure. The fields in the ksegrp structure that are to do with the scheduler's queueing mechanisms are now moved to the kg_sched structure. (per ksegrp scheduler private data structure). In other words how the scheduler queues and keeps track of threads is no-one's business except the scheduler's. This should allow people to write experimental schedulers with completely different internal structuring. A scheduler call sched_set_concurrency(kg, N) has been added that notifies teh scheduler that no more than N threads from that ksegrp should be allowed to be on concurrently scheduled. This is also used to enforce 'fainess' at this time so that a ksegrp with 10000 threads can not swamp a the run queue and force out a process with 1 thread, since the current code will not set the concurrency above NCPU, and both schedulers will not allow more than that many onto the system run queue at a time. Each scheduler should eventualy develop their own methods to do this now that they are effectively separated. Rejig libthr's kernel interface to follow the same code paths as linkse for scope system threads. This has slightly hurt libthr's performance but I will work to recover as much of it as I can. Thread exit code has been cleaned up greatly. exit and exec code now transitions a process back to 'standard non-threaded mode' before taking the next step. Reviewed by: scottl, peter MFC after: 1 week
# 134586	01-Sep-2004	julian	Give setrunqueue() and sched_add() more of a clue as to where they are coming from and what is expected from them. MFC after: 2 days
# 134583	31-Aug-2004	julian	Note free bit.. as per the flags word. MFC after: 2 days
# 134572	31-Aug-2004	davidxu	Remove TDP_USTATCLOCK, we no longer need it because we now always update tick count for userland in thread_userret. This change also removes a "no upcall owned" panic because fuword() schedules an upcall under heavily loaded, and code assumes there is no upcall can occur. Reported and Tested by: Peter Holm <peter@holm.cc>
# 134571	31-Aug-2004	julian	Remove an unneeded argument.. The removed argument could trivially be derived from the remaining one. That in turn should be the same as curthread, but it is possible that curthread could be expensive to derive on some syste,s so leave it as an argument. Having both proc and thread as an argumen tjust gives an opportunity for them to get out sync. MFC after: 3 days
# 134425	28-Aug-2004	davidxu	Move TDF_CAN_UNBIND to thread private flags td_pflags, this eliminates need of sched_lock in some places. Also in thread_userret, remove spare thread allocation code, it is already done in thread_user_enter. Reviewed by: julian
# 134013	19-Aug-2004	jhb	Now that the return value semantics of cv's for multithreaded processes have been unified with that of msleep(9), further refine the sleepq interface and consolidate some duplicated code: - Move the pre-sleep checks for theaded processes into a thread_sleep_check() function in kern_thread.c. - Move all handling of TDF_SINTR to be internal to subr_sleepqueue.c. Specifically, if a thread is awakened by something other than a signal while checking for signals before going to sleep, clear TDF_SINTR in sleepq_catch_signals(). This removes a sched_lock lock/unlock combo in that edge case during an interruptible sleep. Also, fix sleepq_check_signals() to properly handle the condition if TDF_SINTR is clear rather than requiring the callers of the sleepq API to notice this edge case and call a non-_sig variant of sleepq_wait(). - Clarify the flags arguments to sleepq_add(), sleepq_signal() and sleepq_broadcast() by creating an explicit submask for sleepq types. Also, add an explicit SLEEPQ_MSLEEP type rather than a magic number of 0. Also, add a SLEEPQ_INTERRUPTIBLE flag for use with sleepq_add() and move the setting of TDF_SINTR to sleepq_add() if this flag is set rather than sleepq_catch_signals(). Note that it is the caller's responsibility to ensure that sleepq_catch_signals() is called if and only if this flag is passed to the preceeding sleepq_add(). Note that this also removes a sched_lock lock/unlock pair from sleepq_catch_signals(). It also ensures that for an interruptible sleep, TDF_SINTR is always set when TD_ON_SLEEPQ() is true.
# 133741	15-Aug-2004	jmg	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)
# 133412	09-Aug-2004	julian	Clean up and add coments. reserve some bits in flag words for upcoming work. Note the unused bits so they can be found easier.
# 132942	31-Jul-2004	julian	Specify the locking for some proc fields
# 132622	24-Jul-2004	rwatson	More comment frobbing: insert a missing comma, and spell permissions as credential.
# 132621	24-Jul-2004	rwatson	Spelling fix in comment: s/though/thought/.
# 132266	16-Jul-2004	jhb	- Move TDF_OWEPREEMPT, TDF_OWEUPC, and TDF_USTATCLOCK over to td_pflags since they are only accessed by curthread and thus do not need any locking. - Move pr_addr and pr_ticks out of struct uprof (which is per-process) and directly into struct thread as td_profil_addr and td_profil_ticks as these variables are really per-thread. (They are used to defer an addupc_intr() that was too "hard" until ast()).
# 132087	13-Jul-2004	davidxu	Add code to support debugging threaded process. 1. Add tm_lwpid into kse_thr_mailbox to indicate which kernel thread current user thread is running on. Add tm_dflags into kse_thr_mailbox, the flags is written by debugger, it tells UTS and kernel what should be done when the process is being debugged, current, there two flags TMDF_SSTEP and TMDF_DONOTRUNUSER. TMDF_SSTEP is used to tell kernel to turn on single stepping, or turn off if it is not set. TMDF_DONOTRUNUSER is used to tell kernel to schedule upcall whenever possible, to UTS, it means do not run the user thread until debugger clears it, this behaviour is necessary because gdb wants to resume only one thread when the thread's pc is at a breakpoint, and thread needs to go forward, in order to avoid other threads sneak pass the breakpoints, it needs to remove breakpoint, only wants one thread to go. Also, add km_lwp to kse_mailbox, the lwp id is copied to kse_thr_mailbox at context switch time when process is not being debugged, so when process is attached, debugger can map kernel thread to user thread. 2. Add p_xthread to proc strcuture and td_xsig to thread structure. p_xthread is used by a thread when it wants to report event to debugger, every thread can set the pointer, especially, when it is used in ptracestop, it is the last thread reporting event will win the race. Every thread has a td_xsig to exchange signal with debugger, thread uses TDF_XSIG flag to indicate it is reporting signal to debugger, if the flag is not cleared, thread will keep retrying until it is cleared by debugger, p_xthread may be used by debugger to indicate CURRENT thread. The p_xstat is still in proc structure to keep wait() to work, in future, we may just use td_xsig. 3. Add TDF_DBSUSPEND flag, the flag is used by debugger to suspend a thread. When process stops, debugger can set the flag for thread, thread will check the flag in thread_suspend_check, enters a loop, unless it is cleared by debugger, process is detached or process is existing. The flag is also checked in ptracestop, so debugger can temporarily suspend a thread even if the thread wants to exchange signal. 4. Current, in ptrace, we always resume all threads, but if a thread has already a TDF_DBSUSPEND flag set by debugger, it won't run. Encouraged by: marcel, julian, deischen
# 132016	12-Jul-2004	marcel	Implement the PT_LWPINFO request. This request can be used by the tracing process to obtain information about the LWP that caused the traced process to stop. Debuggers can use this information to select the thread currently running on the LWP as the current thread. The request has been made compatible with NetBSD for as much as possible. This implementation differs from NetBSD in the following ways: 1. The data argument is allowed to be smaller than the size of the ptrace_lwpinfo structure known to the kernel, but not 0. This is opposite to what NetBSD allows. The reason for this is that we can extend the structure without affecting older binaries. 2. On NetBSD the tracing process is to set the pl_lwpid field to the Id of the LWP it wants information of. We don't do that. Our ptrace interface allows passing the LWP Id instead of the PID. The tracing process is to set the PID to the LWP Id it wants information of. 3. When the PID is actually the PID of the tracing process, this request returns the information about the LWP that caused the process to stop. This was the whole purpose of the request in the first place. When the traced process has exited, this request will return the LWP Id 0, indicating that the process state is not the result of an event specific to a LWP.
# 131481	02-Jul-2004	jhb	Implement preemption of kernel threads natively in the scheduler rather than as one-off hacks in various other parts of the kernel: - Add a function maybe_preempt() that is called from sched_add() to determine if a thread about to be added to a run queue should be preempted to directly. If it is not safe to preempt or if the new thread does not have a high enough priority, then the function returns false and sched_add() adds the thread to the run queue. If the thread should be preempted to but the current thread is in a nested critical section, then the flag TDF_OWEPREEMPT is set and the thread is added to the run queue. Otherwise, mi_switch() is called immediately and the thread is never added to the run queue since it is switch to directly. When exiting an outermost critical section, if TDF_OWEPREEMPT is set, then clear it and call mi_switch() to perform the deferred preemption. - Remove explicit preemption from ithread_schedule() as calling setrunqueue() now does all the correct work. This also removes the do_switch argument from ithread_schedule(). - Do not use the manual preemption code in mtx_unlock if the architecture supports native preemption. - Don't call mi_switch() in a loop during shutdown to give ithreads a chance to run if the architecture supports native preemption since the ithreads will just preempt DELAY(). - Don't call mi_switch() from the page zeroing idle thread for architectures that support native preemption as it is unnecessary. - Native preemption is enabled on the same archs that supported ithread preemption, namely alpha, i386, and amd64. This change should largely be a NOP for the default case as committed except that we will do fewer context switches in a few cases and will avoid the run queues completely when preempting. Approved by: scottl (with his re@ hat)
# 131473	02-Jul-2004	jhb	- Change mi_switch() and sched_switch() to accept an optional thread to switch to. If a non-NULL thread pointer is passed in, then the CPU will switch to that thread directly rather than calling choosethread() to pick a thread to choose to. - Make sched_switch() aware of idle threads and know to do TD_SET_CAN_RUN() instead of sticking them on the run queue rather than requiring all callers of mi_switch() to know to do this if they can be called from an idlethread. - Move constants for arguments to mi_switch() and thread_single() out of the middle of the function prototypes and up above into their own section.
# 131149	26-Jun-2004	marcel	Allocate TIDs in thread_init() and deallocate them in thread_fini(). The overhead of unconditionally allocating TIDs (and likewise, unconditionally deallocating them), is amortized across multiple thread creations by the way UMA makes it possible to have type-stable storage. Previously the cost was kept down by having threads created as part of a fork operation use the process' PID as the TID. While this had some nice properties, it also introduced complexity in the way TIDs were allocated. Most importantly, by using the type-stable storage that UMA gives us this was also unnecessary. This change affects how core dumps are created and in particular how the PRSTATUS notes are dumped. Since we don't have a thread with a TID equalling the PID, we now need a different way to preserve the old and previous behavior. We do this by having the given thread (i.e. the thread passed to the core dump code in td) dump it's state first and fill in pr_pid with the actual PID. All other threads will have pr_pid contain their TIDs. The upshot of all this is that the debugger will now likely select the right LWP (=TID) as the initial thread. Credits to: julian@ for spotting how we can utilize UMA. Thanks to: all who provided julian@ with test results.
# 130735	19-Jun-2004	marcel	Define __lwpid_t as an int32_t in <sys/_types.h> and define lwpid_t as an __lwpid_t in <sys/types.h>. Retype td_tid from an int to a lwpid_t and change related definitions accordingly.
# 130731	19-Jun-2004	bde	Include <sys/_lock.h>'s prerequisite <sys/queue.h> before including the former, not after. Don't hide this bug by including <sys/queue.h> in <sys/_lock.h>.
# 130551	15-Jun-2004	julian	Nice, is a property of a process as a whole.. I mistakenly moved it to the ksegroup when breaking up the process structure. Put it back in the proc structure.
# 130023	02-Jun-2004	tjr	Move TDF_DEADLKTREAT into td_pflags (and rename it accordingly) to avoid having to acquire sched_lock when manipulating it in lockmgr(), uiomove(), and uiomove_fromphys(). Reviewed by: jhb
# 129989	02-Jun-2004	tjr	Move TDF_SA from td_flags to td_pflags (and rename it accordingly) so that it is no longer necessary to hold sched_lock while manipulating it. Reviewed by: davidxu
# 129750	26-May-2004	tmm	Retire cpu_sched_exit(); it is not used any more.
# 128721	28-Apr-2004	deischen	Keep track of threads waiting in kse_release() to avoid a race condition where kse_wakeup() doesn't yet see them in (interruptible) sleep queues. Also add an upcall check to sleepqueue_catch_signals() suggested by jhb. This commit should fix recent mysql hangs. Reviewed by: jhb, davidxu Mysql'd by: Robin P. Blanchard <robin.blanchard at gactr uga edu>
# 127976	07-Apr-2004	imp	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# 127794	03-Apr-2004	marcel	Assign thread IDs to kernel threads. The purpose of the thread ID (tid) is twofold: 1. When a 1:1 or M:N threaded process dumps core, we need to put the register state of each of its kernel threads in the core file. This can only be done by differentiating the pid field in the respective note. For this we need the tid. 2. When thread support is present for remote debugging the kernel with gdb(1), threads need to be identified by an integer due to limitations in the remote protocol. This requires having a tid. To minimize the impact of having thread IDs, threads that are created as part of a fork (i.e. the initial thread in a process) will inherit the process ID (i.e. tid=pid). Subsequent threads will have IDs larger than PID_MAX to avoid interference with the pid allocation algorithm. The assignment of tids is handled by thread_new_tid(). The thread ID allocation algorithm has been written with 3 assumptions in mind: 1. IDs need to be created as fast a possible, 2. Reuse of IDs may happen instantaneously, 3. Someone else will write a better algorithm.
# 127696	31-Mar-2004	pjd	Remove sysctl kern.ps_argsopen, it is not very useful, one should use security.bsd.see_other_uids instead. Discussed with: phk, rwatson
# 127514	28-Mar-2004	bde	Fixed s style bug in previous commit (tab lossage). Fixed some nearby style bugs (more tab lossage, unclear description of TDF_USTATCLOCK, and English usage errors).
# 127482	27-Mar-2004	mtm	Separate thread synchronization from signals in libthr. Instead use msleep() and wakeup_one(). Discussed with: jhb, peter, tjr
# 126326	27-Feb-2004	jhb	Switch the sleep/wakeup and condition variable implementations to use the sleep queue interface: - Sleep queues attempt to merge some of the benefits of both sleep queues and condition variables. Having sleep qeueus in a hash table avoids having to allocate a queue head for each wait channel. Thus, struct cv has shrunk down to just a single char * pointer now. However, the hash table does not hold threads directly, but queue heads. This means that once you have located a queue in the hash bucket, you no longer have to walk the rest of the hash chain looking for threads. Instead, you have a list of all the threads sleeping on that wait channel. - Outside of the sleepq code and the sleep/cv code the kernel no longer differentiates between cv's and sleep/wakeup. For example, calls to abortsleep() and cv_abort() are replaced with a call to sleepq_abort(). Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and cv_waitq_remove() have been replaced with calls to sleepq_remove(). - The sched_sleep() function no longer accepts a priority argument as sleep's no longer inherently bump the priority. Instead, this is soley a propery of msleep() which explicitly calls sched_prio() before blocking. - The TDF_ONSLEEPQ flag has been dropped as it was never used. The associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been dropped and replaced with a single explicit clearing of td_wchan. TD_SET_ONSLEEPQ() would really have only made sense if it had taken the wait channel and message as arguments anyway. Now that that only happens in one place, a macro would be overkill.
# 126318	27-Feb-2004	jhb	Fix a few style nits. do { } while(0) are only used for compound statements and nowhere else in the kernel seems to use them for single statements. Also, all other users of do { } while(0) use multiple lines rather than cramming it all onto one line.
# 125454	04-Feb-2004	jhb	Locking for the per-process resource limits structure. - struct plimit includes a mutex to protect a reference count. The plimit structure is treated similarly to struct ucred in that is is always copy on write, so having a reference to a structure is sufficient to read from it without needing a further lock. - The proc lock protects the p_limit pointer and must be held while reading limits from a process to keep the limit structure from changing out from under you while reading from it. - Various global limits that are ints are not protected by a lock since int writes are atomic on all the archs we support and thus a lock wouldn't buy us anything. - All accesses to individual resource limits from a process are abstracted behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return either an rlimit, or the current or max individual limit of the specified resource from a process. - dosetrlimit() was renamed to kern_setrlimit() to match existing style of other similar syscall helper functions. - The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit() (it didn't used the stackgap when it should have) but uses lim_rlimit() and kern_setrlimit() instead. - The svr4 compat no longer uses the stackgap for resource limits calls, but uses lim_rlimit() and kern_setrlimit() instead. - The ibcs2 compat no longer uses the stackgap for resource limits. It also no longer uses the stackgap for accessing sysctl's for the ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result, ibcs2_sysconf() no longer needs Giant. - The p_rlimit macro no longer exists. Submitted by: mtm (mostly, I only did a few cleanups and catchups) Tested on: i386 Compiled on: alpha, amd64
# 125111	27-Jan-2004	rwatson	Allow the use of a stale p_stops value in STOPEVENT(), grabbing the proc lock only if we actually need to perform a stop. This avoids two locks and unlocks of the process lock each system call, and wins me about 20% on a simply system call test (getuid(), which would otherwise require no locking). This also has a net improvement of about 10MB/s on some of the SMP bandwidth tests I'm running. Reviewed by: jhb
# 124944	25-Jan-2004	jeff	- Add a flags parameter to mi_switch. The value of flags may be SW_VOL or SW_INVOL. Assert that one of these is set in mi_switch() and propery adjust the rusage statistics. This is to simplify the large number of users of this interface which were previously all required to adjust the proper counter prior to calling mi_switch(). This also facilitates more switch and locking optimizations. - Change all callers of mi_switch() to pass the appropriate paramter and remove direct references to the process statistics.
# 124829	22-Jan-2004	grehan	Make proc's kg_nice/ki_nice explicitly signed for PPC. This is a no-op on {i386/alpha/ia64/sparc64} where chars are signed by default. Should help ARM and S390 which also suffer from this. Tested on: ppc, i386, objdump disasm before/after diffs Reviewed by: obrien, bde (a while back)
# 124092	03-Jan-2004	davidxu	Make sigaltstack as per-threaded, because per-process sigaltstack state is useless for threaded programs, multiple threads can not share same stack. The alternative signal stack is private for thread, no lock is needed, the orignal P_ALTSTACK is now moved into td_pflags and renamed to TDP_ALTSTACK. For single thread or Linux clone() based threaded program, there is no semantic changed, because those programs only have one kernel thread in every process. Reviewed by: deischen, dfr
# 123704	21-Dec-2003	jeff	- Cleanup some garbage left by KSE. There is still much garbage left to be removed, see the 110 instances of "XXXKSE" in src/sys for examples.
# 122769	15-Nov-2003	imp	Remove the WITNESS debug code from _STOPEVENT. It has pointed out a problem, but the spamage of consoles is really bad. Until we can get this to be less chatty, disable it so people can boot. The badness of the spamage is worse than the badness that it reports :-(. Once the underlying problems have been fixed, it can be reenabled. Approved by: kken, markm, rwatson, grog, murray
# 122733	15-Nov-2003	bde	Fixed some bugs in macros: - missing parenthesization of some macro args - point of do-while(0) hack defeated by putting a semicolon after while(0) Fixed some style bugs in macros: - not splitting the line when the macro value cannot be lined up in much the same macros that didn't parenthesize their args - braces around a 1-line statement - do-while(0) hack not indented in the usual way in the same macros that defeated its point.
# 122673	14-Nov-2003	jhb	Add a WITNESS_WARN() check to _STOPEVENT() since any _STOPEVENT() can potentially sleep. This should help find some bogus locking.
# 122524	12-Nov-2003	rwatson	Modify the MAC Framework so that instead of embedding a (struct label) in various kernel objects to represent security data, we embed a (struct label *) pointer, which now references labels allocated using a UMA zone (mac_label.c). This allows the size and shape of struct label to be varied without changing the size and shape of these kernel objects, which become part of the frozen ABI with 5-STABLE. This opens the door for boot-time selection of the number of label slots, and hence changes to the bound on the number of simultaneous labeled policies at boot-time instead of compile-time. This also makes it easier to embed label references in new objects as required for locking/caching with fine-grained network stack locking, such as inpcb structures. This change also moves us further in the direction of hiding the structure of kernel objects from MAC policy modules, not to mention dramatically reducing the number of '&' symbols appearing in both the MAC Framework and MAC policy modules, and improving readability. While this results in minimal performance change with MAC enabled, it will observably shrink the size of a number of critical kernel data structures for the !MAC case, and should have a small (but measurable) performance benefit (i.e., struct vnode, struct socket) do to memory conservation and reduced cost of zeroing memory. NOTE: Users of MAC must recompile their kernel and all MAC modules as a result of this change. Because this is an API change, third party MAC modules will also need to be updated to make less use of the '&' symbol. Suggestions from: bmilekic Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 122514	11-Nov-2003	jhb	Add an implementation of turnstiles and change the sleep mutex code to use turnstiles to implement blocking isntead of implementing a thread queue directly. These turnstiles are somewhat similar to those used in Solaris 7 as described in Solaris Internals but are also different. Turnstiles do not come out of a fixed-sized pool. Rather, each thread is assigned a turnstile when it is created that it frees when it is destroyed. When a thread blocks on a lock, it donates its turnstile to that lock to serve as queue of blocked threads. The queue associated with a given lock is found by a lookup in a simple hash table. The turnstile itself is protected by a lock associated with its entry in the hash table. This means that sched_lock is no longer needed to contest on a mutex. Instead, sched_lock is only used when manipulating run queues or thread priorities. Turnstiles also implement priority propagation inherently. Currently turnstiles only support mutexes. Eventually, however, turnstiles may grow two queue's to support a non-sleepable reader/writer lock implementation. For more details, see the comments in sys/turnstile.h and kern/subr_turnstile.c. The two primary advantages from the turnstile code include: 1) the size of struct mutex shrinks by four pointers as it no longer stores the thread queue linkages directly, and 2) less contention on sched_lock in SMP systems including the ability for multiple CPUs to contend on different locks simultaneously (not that this last detail is necessarily that much of a big win). Note that 1) means that this commit is a kernel ABI breaker, so don't mix old modules with a new kernel and vice versa. Tested on: i386 SMP, sparc64 SMP, alpha SMP
# 122171	06-Nov-2003	bde	Fixed some more missing punctuation in comments (most instances in this file except for about 30 lines that have more errors and/or need rewording to fit the punctuation).
# 122167	06-Nov-2003	bde	Fixed some style bugs (missing punctuation in comments). There are many more of these in proc.h alone.
# 122157	06-Nov-2003	jeff	- Add a pinned count to the thread so that cpu pinning may nest. This is not in scheduler specific data because eventually it will be required by all schedulers. - Implement sched_pin and unpin as an inline for now. If a scheduler needs to do something more complicated than adjusting the pinned count we can move this into a function later in an api compatible way.
# 121789	31-Oct-2003	jeff	- Add 4 kse flags for use in the schedulers.
# 121443	23-Oct-2003	jhb	Move the P_COWINPROGRESS flag from being a per-process p_flag to being a per-thread td_pflag which doesn't require any locks to read or write as it is only read or written by curthread on itself. Glanced at by: mckusick
# 121228	18-Oct-2003	njl	Add the cpu_idle_hook() function pointer so that other idlers can be hooked at runtime. Make C1 sleep (e.g., HLT) be the default. This prepares the way for further ACPI sleep states.
# 120807	05-Oct-2003	bms	Mark td_generation as volatile.
# 120802	05-Oct-2003	bms	Add a pre-emption counter, td_generation, so that threads can notice when they have been pre-empted by other threads. This is bumped from within mi_switch() every time a context switch takes place. Discussed with: pete
# 120602	30-Sep-2003	jeff	- On my Pentium4-M laptop, invalpg takes ~1100 cycles if the page is found in the TLB and ~1600 if it is not. Therefore, it is more effecient to invalidate the TLB after operations that use CMAP rather than before. - So that the tlb is invalidated prior to switching off of a processor, we must change the switchin functions to switchout functions. - Remove td_switchout from the thread and move it to the x86 pcb. - Move the code that calls switchout into swtch.s. These changes make this optimization truely x86 specific.
# 119004	16-Aug-2003	marcel	In vm_thread_swap{in\|out}(), remove the alpha specific conditional compilation and replace it with a call to cpu_thread_swap{in\|out}(). This allows us to add similar code on ia64 without cluttering the code even more.
# 118893	14-Aug-2003	grehan	Update powerpc to use the (old thread,new thread) calling convention for cpu_throw() and cpu_switch().
# 118835	12-Aug-2003	jhb	- Convert Alpha over to the new calling conventions for cpu_throw() and cpu_switch() where both the old and new threads are passed in as arguments. Only powerpc uses the old conventions now. - Update comments in the Alpha swtch.s to reflect KSE changes. Tested by: obrien, marcel
# 118442	04-Aug-2003	jhb	Set td_critnest to 1 when setting up a thread since it is a MI field with MI values. This ensures that td_critnest for a newly fork'd thread is always valid. Requested by: bde (a long time ago)
# 117730	18-Jul-2003	tjr	Remove extern declaration of ps_showallprocs. The definition is already gone. PR: 54604 Submitted by: Pawel Jakub Dawidek
# 117704	17-Jul-2003	davidxu	o Refine kse_thr_interrupt to allow it to handle different commands. o Remove TDF_NOSIGPOST. o Add a member td_waitset to proc structure, it will be used for sigwait. Tested by: deischen
# 117685	17-Jul-2003	mtm	Fix umtx locking, for libthr, in the kernel. 1. There was a race condition between a thread unlocking a umtx and the thread contesting it. If the unlocking thread won the race it may try to wakeup a thread that was not yet in msleep(). The contesting thread would then go to sleep to await a wakeup that would never come. It's not possible to close the race by using a lock because calls to casuptr() may have to fault a page in from swap. Instead, the race was closed by introducing a flag that the unlocking thread will set when waking up a thread. The contesting thread will check for this flag before going to sleep. For now the flag is kept in td_flags, but it may be better to use some other member or create a new one because of the possible performance/contention issues of having to own sched_lock. Thanks to jhb for pointing me in the right direction on this one. 2. Once a umtx was contested all future locks and unlocks were happening in the kernel, regardless of whether it was contested or not. To prevent this from happening, when a thread locks a umtx it checks the queue for that umtx and unsets the contested bit if there are no other threads waiting on it. Again, this is slightly more complicated than it needs to be because we can't hold a lock across casuptr(). So, the thread has to check the queue again after unseting the bit, and reset the contested bit if it finds that another thread has put itself on the queue in the mean time. 3. Remove the if... block for unlocking an uncontested umtx, and replace it with a KASSERT. The _only_ time a thread should be unlocking a umtx in the kernel is if it is contested.
# 117600	14-Jul-2003	davidxu	Rename thread_siginfo to cpu_thread_siginfo. Suggested by: jhb
# 116963	28-Jun-2003	davidxu	o Change kse_thr_interrupt to allow send a signal to a specified thread, or unblock a thread in kernel, and allow UTS to specify whether syscall should be restarted. o Add ability for UTS to monitor signal comes in and removed from process, the flag PS_SIGEVENT is used to indicate the events. o Add a KMF_WAITSIGEVENT for KSE mailbox flag, UTS call kse_release with this flag set to wait for above signal event. o For SA based thread, kernel masks all signal in its signal mask, let UTS to use kse_thr_interrupt interrupt a thread, and install a signal frame in userland for the thread. o Add a tm_syncsig in thread mailbox, when a hardware trap occurs, it is used to deliver synchronous signal to userland, and upcall is schedule, so UTS can process the synchronous signal for the thread. Reviewed by: julian (mentor)
# 116372	15-Jun-2003	davidxu	1. Migrate TDF_UPCALLING from td_flags to td_pflags. 2. Add a flag TDF_SA, it will be used to distinguish SA based thread from bound thread.
# 116361	14-Jun-2003	davidxu	Rename P_THREADED to P_SA. P_SA means a process is using scheduler activations.
# 116228	11-Jun-2003	davidxu	Reorder P_* flags.
# 116188	11-Jun-2003	peter	GC unused cpu_wait() function
# 116101	09-Jun-2003	jhb	- Add a td_pflags field to struct thread for private flags accessed only by curthread. Unlike td_flags, this field does not need any locking. - Replace the td_inktr and td_inktrace variables with equivalent private thread flags. - Move TDF_OLDMASK over to the private flags field so it no longer requires sched_lock.
# 115858	04-Jun-2003	marcel	Change the second (and last) argument of cpu_set_upcall(). Previously we were passing in a void* representing the PCB of the parent thread. Now we pass a pointer to the parent thread itself. The prime reason for this change is to allow cpu_set_upcall() to copy (parts of) the trapframe instead of having it done in MI code in each caller of cpu_set_upcall(). Copying the trapframe cannot always be done with a simply bcopy() or may not always be optimal that way. On ia64 specifically the trapframe contains information that is specific to an entry into the kernel and can only be used by the corresponding exit from the kernel. A trapframe copied verbatim from another frame is in most cases useless without some additional normalization. Note that this change removes the assignment to td->td_frame in some implementations of cpu_set_upcall(). The assignment is redundant. A previous call to cpu_thread_setup() already did the exact same assignment. An added benefit of removing the redundant assignment is that we can now change td_pcb without nasty side-effects. This change officially marks the ability on ia64 for 1:1 threading. Not tested on: amd64, powerpc Compile & boot tested on: alpha, sparc64 Functionally tested on: i386, ia64
# 115790	03-Jun-2003	julian	Remove un-needed code. Don't copyin() data we are about to overwrite. Add a flag to tell userland that KSE is officially "DONE" with the mailbox and has gone away. Obtained from: davidxu@
# 115765	03-Jun-2003	jeff	- Remove the blocked pointer from the umtx structure. - Use a hash of umtx queues to queue blocked threads. We hash on pid and the virtual address of the umtx structure. This eliminates cases where we previously held a lock across a casuptr call. Reviwed by: jhb (quickly)
# 115702	02-Jun-2003	tegge	Add tracking of process leaders sharing a file descriptor table and allow a file descriptor table to be shared between multiple process leaders. PR: 50923
# 115084	16-May-2003	marcel	Revamp of the syscall path, exception and context handling. The prime objectives are: o Implement a syscall path based on the epc inststruction (see sys/ia64/ia64/syscall.s). o Revisit the places were we need to save and restore registers and define those contexts in terms of the register sets (see sys/ia64/include/_regset.h). Secundairy objectives: o Remove the requirement to use contigmalloc for kernel stacks. o Better handling of the high FP registers for SMP systems. o Switch to the new cpu_switch() and cpu_throw() semantics. o Add a good unwinder to reconstruct contexts for the rare cases we need to (see sys/contrib/ia64/libuwx) Many files are affected by this change. Functionally it boils down to: o The EPC syscall doesn't preserve registers it does not need to preserve and places the arguments differently on the stack. This affects libc and truss. o The address of the kernel page directory (kptdir) had to be unstaticized for use by the nested TLB fault handler. The name has been changed to ia64_kptdir to avoid conflicts. The renaming affects libkvm. o The trapframe only contains the special registers and the scratch registers. For syscalls using the EPC syscall path no scratch registers are saved. This affects all places where the trapframe is accessed. Most notably the unaligned access handler, the signal delivery code and the debugger. o Context switching only partly saves the special registers and the preserved registers. This affects cpu_switch() and triggered the move to the new semantics, which additionally affects cpu_throw(). o The high FP registers are either in the PCB or on some CPU. context switching for them is done lazily. This affects trap(). o The mcontext has room for all registers, but not all of them have to be defined in all cases. This mostly affects signal delivery code now. The *context syscalls are as of yet still unimplemented. Many details went into the removal of the requirement to use contigmalloc for kernel stacks. The details are mostly CPU specific and limited to exception_save() and exception_restore(). The few places where we create, destroy or switch stacks were mostly simplified by not having to construct physical addresses and additionally saving the virtual addresses for later use. Besides more efficient context saving and restoring, which of course yields a noticable speedup, this also fixes the dreaded SMP bootup problem as a side-effect. The details of which are still not fully understood. This change includes all the necessary backward compatibility code to have it handle older userland binaries that use the break instruction for syscalls. Support for break-based syscalls has been pessimized in favor of a clean implementation. Due to the overall better performance of the kernel, this will still be notived as an improvement if it's noticed at all. Approved by: re@ (jhb)
# 114983	13-May-2003	jhb	- Merge struct procsig with struct sigacts. - Move struct sigacts out of the u-area and malloc() it using the M_SUBPROC malloc bucket. - Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(), sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared(). - Remove the p_sigignore, p_sigacts, and p_sigcatch macros. - Add a mutex to struct sigacts that protects all the members of the struct. - Add sigacts locking. - Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now that sigacts is locked. - Several in-kernel functions such as psignal(), tdsignal(), trapsignal(), and thread_stopped() are now MP safe. Reviewed by: arch@ Approved by: re (rwatson)
# 114471	01-May-2003	julian	Move the flag that indicates an idle thread from the KSE to the thread. It was always referenced via the thread anyhow. Reviewed by: jhb (a LOOOOONG time ago)
# 114462	01-May-2003	jhb	Small style tweaks to some members of struct session and sigio to be more consistent with other structures in this file.
# 114435	01-May-2003	jhb	Garbage collect unused TDF_INMSLEEP flag.
# 114336	30-Apr-2003	peter	AMD64 uses the new-style cpu_switch()/cpu_throw() calling conventions.
# 113973	24-Apr-2003	wes	Make P_PROTECTED not conflict with P_STOPPED_SIG. Replace P_UNUSED100000 which is truly unused, until now. Submitted by: Robert Drehmel <robert@zoot.drehmel.com>
# 113925	23-Apr-2003	jhb	Update many of the locking notes and comments for struct thread/kse/ksegroup/proc.
# 113874	22-Apr-2003	jhb	- Move PS_PROFIL and its new cousin PS_STOPPROF back over to p_flag and rename them appropriately. Protect both flags with both the proc lock and the sched_lock. - Protect p_profthreads with the proc lock. - Remove Giant from profil(2).
# 113867	22-Apr-2003	jhb	- Always call faultin() in _PHOLD() if PS_INMEM is clear. This closes a race where a thread could assume that a process was swapped in by PHOLD() when it actually wasn't fully swapped in yet. - In faultin(), always msleep() if PS_SWAPPINGIN is set instead of doing this check after bumping p_lock in the PS_INMEM == 0 case. Also, sched_lock is only needed for setting and clearning swapping PS_* flags and the swap thread inhibitor. - Don't set and clear the thread swap inhibitor in the same loops as the pmap_swapin/out_thread() since we have to do it under sched_lock. Instead, mimic the treatment of the PS_INMEM flag and use separate loops to set the inhibitors when clearing PS_INMEM and clear the inhibitors when setting PS_INMEM. - swapout() now returns with the proc lock held as it holds the lock while adjusting the swapping-related PS_* flags so that the proc lock can be used to test those flags. - Only use the proc lock to check the swapping-related PS_* flags in several places. - faultin() no longer requires sched_lock to be held by callers. - Rename PS_SWAPPING to PS_SWAPPINGOUT to be less ambiguous now that we have PS_SWAPPINGIN.
# 113792	21-Apr-2003	davidxu	Add a member field for kse_upcall to cache kse mailbox flags. Code for this will be committed soon.
# 113696	18-Apr-2003	julian	Back out last commit.
# 113690	18-Apr-2003	jhb	- Make sigonstack() a regular function instead of an inline and add a proc lock assertion to it. - SIGPENDING() no longer needs sched_lock, so only grab sched_lock to set the TDF_NEEDSIGCHK and TDF_ASTPENDING flags in signotify(). - Add a proc lock assertion to tdsigwakeup(). - Since we always set TDF_OLDMASK while holding the proc lock, the proc lock is sufficient protection to check its state in postsig() and we only need sched_lock when clearing the actual flag.
# 113678	18-Apr-2003	julian	Revert parts of 1.309 to allow processes to have a signal mask independently from the threads again. Will be adding code to use this soon..
# 113641	17-Apr-2003	julian	Add a thread_unlink() and use it. It could also be used twice in kern_thr.c but that's owned by jeff so I'l let him change it when he's next there.
# 113627	17-Apr-2003	jhb	Adjust a few comments.
# 113450	13-Apr-2003	jake	Made vmspace0 non-static. Its useful to be able to identify a vmspace as the kernel vmspace.
# 113339	10-Apr-2003	julian	Move the _oncpu entry from the KSE to the thread. The entry in the KSE still exists but it's purpose will change a bit when we add the ability to lock a KSE to a cpu.
# 112993	02-Apr-2003	peter	Commit a partial lazy thread switch mechanism for i386. it isn't as lazy as it could be and can do with some more cleanup. Currently its under options LAZY_SWITCH. What this does is avoid %cr3 reloads for short context switches that do not involve another user process. ie: we can take an interrupt, switch to a kthread and return to the user without explicitly flushing the tlb. However, this isn't as exciting as it could be, the interrupt overhead is still high and too much blocks on Giant still. There are some debug sysctls, for stats and for an on/off switch. The main problem with doing this has been "what if the process that you're running on exits while we're borrowing its address space?" - in this case we use an IPI to give it a kick when we're about to reclaim the pmap. Its not compiled in unless you add the LAZY_SWITCH option. I want to fix a few more things and get some more feedback before turning it on by default. This is NOT a replacement for Bosko's lazy interrupt stuff. This was more meant for the kthread case, while his was for interrupts. Mine helps a little for interrupts, but his helps a lot more. The stats are enabled with options SWTCH_OPTIM_STATS - this has been a pseudo-option for years, I just added a bunch of stuff to it. One non-trivial change was to select a new thread before calling cpu_switch() in the first place. This allows us to catch the silly case of doing a cpu_switch() to the current process. This happens uncomfortably often. This simplifies a bit of the asm code in cpu_switch (no longer have to call choosethread() in the middle). This has been implemented on i386 and (thanks to jake) sparc64. The others will come soon. This is actually seperate to the lazy switch stuff. Glanced at by: jake, jhb
# 112905	31-Mar-2003	jeff	- Add an entry and a head for the queue of threads blocked on a umtx. - Add a prototype for thr_exit1().
# 112889	31-Mar-2003	jeff	- Move the NEEDSIGCHK and OLDMASK flags from proc to thread. - Move the signal mask to the thread. - Adjust a few comments.
# 112881	31-Mar-2003	wes	Add a facility allowing processes to inform the VM subsystem they are critical and should not be killed when pageout is looking for more memory pages in all the wrong places. Reviewed by: arch@ Sponsored by: St. Bernard Software
# 112704	27-Mar-2003	maxim	o Fix a comment. o GC an unused macro. PR: kern/49083 Submitted by: Bjoern A. Zeeb <bzeeb+freebsd@zabbadoz.net> Not objected by: rwatson
# 112397	19-Mar-2003	davidxu	Adjust code for userland preemptive. Userland can set a quantum in kse_mailbox to schedule an upcall, this is useful for userland timeout routine, for example pthread_cond_timedwait(). Also extract upcall scheduling code from kse_reassign and create a new function called thread_switchout to include these code. Reviewed by: julain
# 112389	18-Mar-2003	des	Whitespace cleanup.
# 112198	13-Mar-2003	jhb	- Cache a reference to the credential of the thread that starts a ktrace in struct proc as p_tracecred alongside the current cache of the vnode in p_tracep. This credential is then used for all later ktrace operations on this file rather than using the credential of the current thread at the time of each ktrace event. - Now that we have multiple ktrace-related items in struct proc that are pointers, rename p_tracep to p_tracevp to make it less ambiguous. Requested by: rwatson (1)
# 112079	11-Mar-2003	davidxu	This is a force-commit for: kern_sig.c 1.215 kern_thread.c 1.103 kern_exit.c 1.199 proc.h 1.302 Orignal code would suspend an already suspended thread, if user presses ^Z while a threaded program is running. Also there is a race between job control and thread_exit(), the new code tests job control requesting before thread exits, in wait() syscall, be sure to check child process is fully stopped, this avoids a later SIGCHILD and returns STOPPED status twice for a threaded child proc. A thread_stopped() function is added for common code in several places.
# 112071	10-Mar-2003	davidxu	Fix threaded process job control bug. SMP tested. Reviewed by: julian
# 111585	27-Feb-2003	julian	Change the process flags P_KSES to be P_THREADED. This is just a cosmetic change but I've been meaning to do it for about a year.
# 111122	19-Feb-2003	davidxu	Eliminate unused KSE symbols.
# 111115	19-Feb-2003	davidxu	Optimize the case when max threads number was hit.
# 111034	17-Feb-2003	tjr	Use the proc lock to protect p_realtimer instead of Giant, and obtain sched_lock around accesses to p_stats->p_timer[] to avoid a potential race with hardclock. getitimer(), setitimer() and the realitexpire() callout are now Giant-free.
# 111033	17-Feb-2003	jeff	- Add a new function, thread_signal_add(), that is called from postsig to add a signal to a mailbox's pending set. - Add a new function, thread_signal_upcall(), this causes the current thread to upcall so that we can deliver pending signals. Reviewed by: mini
# 111032	17-Feb-2003	julian	Move a bunch of flags from the KSE to the thread. I was in two minds as to where to put them in the first case.. I should have listenned to the other mind. Submitted by: parts by davidxu@ Reviewed by: jeff@ mini@
# 111028	17-Feb-2003	jeff	- Split the struct kse into struct upcall and struct kse. struct kse will soon be visible only to schedulers. This greatly simplifies much the KSE code. Submitted by: davidxu
# 111024	17-Feb-2003	jeff	- Move ke_sticks, ke_iticks, ke_uticks, ke_uu, ke_su, and ke_iu back into the proc. These counters are only examined through calcru. Submitted by: davidxu Tested on: x86, alpha, UP/SMP
# 110580	09-Feb-2003	julian	Fix spelling errors in comments Submitted by: Davidxu@
# 110530	08-Feb-2003	julian	A little infrastructure, preceding some upcoming changes to the profiling and statistics code. Submitted by: DavidXu@ Reviewed by: peter@
# 110190	01-Feb-2003	julian	Reversion of commit by Davidxu plus fixes since applied. I'm not convinced there is anything major wrong with the patch but them's the rules.. I am using my "David's mentor" hat to revert this as he's offline for a while.
# 109877	26-Jan-2003	davidxu	Move UPCALL related data structure out of kse, introduce a new data structure called kse_upcall to manage UPCALL. All KSE binding and loaning code are gone. A thread owns an upcall can collect all completed syscall contexts in its ksegrp, turn itself into UPCALL mode, and takes those contexts back to userland. Any thread without upcall structure has to export their contexts and exit at user boundary. Any thread running in user mode owns an upcall structure, when it enters kernel, if the kse mailbox's current thread pointer is not NULL, then when the thread is blocked in kernel, a new UPCALL thread is created and the upcall structure is transfered to the new UPCALL thread. if the kse mailbox's current thread pointer is NULL, then when a thread is blocked in kernel, no UPCALL thread will be created. Each upcall always has an owner thread. Userland can remove an upcall by calling kse_exit, when all upcalls in ksegrp are removed, the group is atomatically shutdown. An upcall owner thread also exits when process is in exiting state. when an owner thread exits, the upcall it owns is also removed. KSE is a pure scheduler entity. it represents a virtual cpu. when a thread is running, it always has a KSE associated with it. scheduler is free to assign a KSE to thread according thread priority, if thread priority is changed, KSE can be moved from one thread to another. When a ksegrp is created, there is always N KSEs created in the group. the N is the number of physical cpu in the current system. This makes it is possible that even an userland UTS is single CPU safe, threads in kernel still can execute on different cpu in parallel. Userland calls kse_create to add more upcall structures into ksegrp to increase concurrent in userland itself, kernel is not restricted by number of upcalls userland provides. The code hasn't been tested under SMP by author due to lack of hardware. Reviewed by: julian
# 109157	13-Jan-2003	jeff	- Unbreak world. I did not notice that libkvm was still used in some places to access the pctcpu. This will have to be sorted out more later as the new scheduler requires a procedural interface for this data. A more complete solution will follow.
# 109145	12-Jan-2003	jeff	- Move ke_pctcpu and ke_cpticks into the scheduler specific datastructure. This will prevent access through mechanisms other than the published interfaces.
# 108613	03-Jan-2003	julian	Make an explicit flag to indicate that a KSE has a reason to upcall, and use that flag when there is a kse_wakeup() call. It will probably be used with signal delivery as well eventually. Submitted by: davidxu@
# 108524	31-Dec-2002	alfred	When compiling the kernel do not implicitly include filedesc.h from proc.h, this was causing filedesc work to be very painful. In order to make this work split out sigio definitions to thier own header (sigio.h) which is included from proc.h for the time being.
# 108338	27-Dec-2002	julian	Add code to ddb to allow backtracing an arbitrary thread. (show thread {address}) Remove the IDLE kse state and replace it with a change in the way threads sahre KSEs. Every KSE now has a thread, which is considered its "owner" however a KSE may also be lent to other threads in the same group to allow completion of in-kernel work. n this case the owner remains the same and the KSE will revert to the owner when the other work has been completed. All creations of upcalls etc. is now done from kse_reassign() which in turn is called from mi_switch or thread_exit(). This means that special code can be removed from msleep() and cv_wait(). kse_release() does not leave a KSE with no thread any more but converts the existing thread into teh KSE's owner, and sets it up for doing an upcall. It is just inhibitted from being scheduled until there is some reason to do an upcall. Remove all trace of the kse_idle queue since it is no-longer needed. "Idle" KSEs are now on the loanable queue.
# 107719	10-Dec-2002	julian	Unbreak the KSE code. Keep track of zobie threads using the Per-CPU storage during the context switch. Rearrange thread cleanups to avoid problems with Giant. Clean threads when freed or when recycled. Approved by: re (jhb)
# 107180	22-Nov-2002	mux	Under certain circumstances, we were calling kmem_free() from i386 cpu_thread_exit(). This resulted in a panic with WITNESS since we need to hold Giant to call kmem_free(), and we weren't helding it anymore in cpu_thread_exit(). We now do this from a new MD function, cpu_thread_dtor(), called by thread_dtor(). Approved by: re@ Suggested by: jhb
# 107135	21-Nov-2002	jeff	- Move scheduler specific macros and defines out of proc.h Approved by: re
# 107126	20-Nov-2002	jeff	- Implement a mechanism for allowing schedulers to place scheduler dependant data in the scheduler independant structures (proc, ksegrp, kse, thread). - Implement unused stubs for this mechanism in sched_4bsd. Approved by: re Reviewed by: luigi, trb Tested on: x86, alpha
# 107105	20-Nov-2002	rwatson	Introduce p_label, extensible security label storage for the MAC framework in struct proc. While the process label is actually stored in the struct ucred pointed to by p_ucred, there is a need for transient storage that may be used when asynchronous (deferred) updates need to be performed on the "real" label for locking reasons. Unlike other label storage, this label has no locking semantics, relying on policies to provide their own protection for the label contents, meaning that a policy leaf mutex may be used, avoiding lock order issues. This permits policies that act based on historical process behavior (such as audit policies, the MAC Framework port of LOMAC, etc) can update process properties even when many existing locks are held without violating the lock order. No currently committed policies implement use of this label storage. Approved by: re Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 107034	17-Nov-2002	davidxu	1. Support versioning and wall clock in kse mailbox, also add rusage time in thread mailbox. 2. Minor change for thread limit code in thread_user_enter(), fix typo in kse_release() last I committed. Reviewed by: deischen, mini
# 106655	08-Nov-2002	rwatson	To reduce per-return overhead of userret(), call into mac_thread_userret() only if PS_MACPEND is set in the process AST mask. This avoids the cost of the entry point in the common case, but requires policies interested in the userret event to set the flag (protected by the scheduler lock) if they do want the event. Since all the policies that we're working with which use mac_thread_userret() use the entry point only selectively to perform operations deferred for locking reasons, this maintains the desired semantics. Approved by: re Requested by: bde Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 106180	30-Oct-2002	davidxu	Add an actual implementation of kse_thr_interrupt()
# 105956	25-Oct-2002	jhb	Don't copy td_md. Instead, let the MD code handle it just like it handles the MD fields of all the other MD portions of proc-related structures. Tested on: i386, alpha, sparc64
# 105900	24-Oct-2002	julian	Extract out KSE specific code from machine specific code so that there is ony one copy of it. Fix that one copy so that KSEs with no mailbox in a KSE program are not a cause of page faults (this can legitmatly happen). Submitted by: (parts) davidxu
# 105854	24-Oct-2002	julian	Move thread related code from kern_proc.c to kern_thread.c. Add code to free KSEs and KSEGRPs on exit. Sort KSE prototypes in proc.h. Add the missing kse_exit() syscall. ksetest now does not leak KSEs and KSEGRPS. Submitted by: (parts) davidxu
# 105663	21-Oct-2002	julian	Remove the process state PRS_WAIT. It is never used. I left it there from pre-KSE days as I didn't know if I'd need it or not but now I know I don't.. It's functionality is in TDI_IWAIT in the thread.
# 105641	21-Oct-2002	julian	Add a flag needed for recovery of excess allocated KSEs. (not used in non KSE processes). Submitted by: davidxu
# 105141	14-Oct-2002	jhb	- Add a new global mutex 'ppeers_lock' to protect the p_peers list of processes forked with RFTHREAD. - Use a goto to a label for common code when exiting from fork1() in case of an error. - Move the RFTHREAD linkage setup code later in fork since the ppeers_lock cannot be locked while holding a proc lock. Handle the race of a task leader exiting and killing its peers while a peer is forking a new child. In that case, go ahead and let the peer process proceed normally as the parent is about to kill it. However, the task leader may have already gone to sleep to wait for the peers to die, so the new child process may not receive a SIGKILL from the task leader. Rather than try to destruct the new child process, just go ahead and send it a SIGKILL directly and add it to the p_peers list. This ensures that the task leader will wait until both the peer process doing the fork() and the new child process have received their KILL signals and exited. Discussed with: truckman (earlier versions)
# 105127	14-Oct-2002	julian	Tidy up the scheduler's code for changing the priority of a thread. Logically pretty much a NOP.
# 104964	12-Oct-2002	jeff	- Create a new scheduler api that is defined in sys/sched.h - Begin moving scheduler specific functionality into sched_4bsd.c - Replace direct manipulation of scheduler data with hooks provided by the new api. - Remove KSE specific state modifications and single runq assumptions from kern_switch.c Reviewed by: -arch
# 104719	09-Oct-2002	jhb	- Move p_cpulimit to struct proc from struct plimit and protect it with sched_lock. This means that we no longer access p_limit in mi_switch() and the p_limit pointer can be protected by the proc lock. - Remove PRS_ZOMBIE check from CPU limit test in mi_switch(). PRS_ZOMBIE processes don't call mi_switch(), and even if they did there is no longer the danger of p_limit being NULL (which is what the original zombie check was added for). - When we bump the current processes soft CPU limit in ast(), just bump the private p_cpulimit instead of the shared rlimit. This fixes an XXX for some value of fix. There is still a (probably benign) bug in that this code doesn't check that the new soft limit exceeds the hard limit. Inspired by: bde (2)
# 104695	09-Oct-2002	julian	Round out the facilty for a 'bound' thread to loan out its KSE in specific situations. The owner thread must be blocked, and the borrower can not proceed back to user space with the borrowed KSE. The borrower will return the KSE on the next context switch where teh owner wants it back. This removes a lot of possible race conditions and deadlocks. It is consceivable that the borrower should inherit the priority of the owner too. that's another discussion and would be simple to do. Also, as part of this, the "preallocatd spare thread" is attached to the thread doing a syscall rather than the KSE. This removes the need to lock the scheduler when we want to access it, as it's now "at hand". DDB now shows a lot mor info for threaded proceses though it may need some optimisation to squeeze it all back into 80 chars again. (possible JKH project) Upcalls are now "bound" threads, but "KSE Lending" now means that other completing syscalls can be completed using that KSE before the upcall finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is one of the highest priorities in the KSE system.) The upcall when it happens will present all the completed syscalls to the KSE for selection.
# 104387	02-Oct-2002	jhb	Rename the mutex thread and process states to use a more generic 'LOCK' name instead. (e.g., SLOCK instead of SMTX, TD_ON_LOCK() instead of TD_ON_MUTEX()) Eventually a turnstile abstraction will be added that will be shared with mutexes and other types of locks. SLOCK/TDI_LOCK will be used internally by the turnstile code and will not be specific to mutexes. Making the change now ensures that turnstiles can be dropped in at a later date without affecting the ABI of userland applications.
# 104354	02-Oct-2002	scottl	Some kernel threads try to do significant work, and the default KSTACK_PAGES doesn't give them enough stack to do much before blowing away the pcb. This adds MI and MD code to allow the allocation of an alternate kstack who's size can be speficied when calling kthread_create. Passing the value 0 prevents the alternate kstack from being created. Note that the ia64 MD code is missing for now, and PowerPC was only partially written due to the pmap.c being incomplete there. Though this patch does not modify anything to make use of the alternate kstack, acpi and usb are good candidates. Reviewed by: jake, peter, jhb
# 104306	01-Oct-2002	jmallett	Back our kernel support for reliable signal queues. Requested by: rwatson, phk, and many others
# 104245	30-Sep-2002	jmallett	(Forced commit, to clarify previous commit of ksiginfo/signal queue code.) I've added a structure, kernel-private, to represent a pending or in-delivery signal, called `ksiginfo'. It is roughly analogous to the basic information that is exported by the POSIX interface 'siginfo_t', but more basic. I've added functions to allocate these structures, and further to wrap all signal operations using them. Once the operations are wrapped, I've added a TailQ (see queue(3)) of these structures to 'struct proc', and all pending signals are in that TailQ. When a signal is being delivered, it is dequeued from the list. Once I finish the spreading of ksiginfo throughout the tree, the dequeued structure will be delivered to the process in question, whereas currently and normally, the signal number is what is used.
# 104240	30-Sep-2002	jhb	- Add a new per-process flag PS_XCPU to indicate that at least one thread has exceeded its CPU time limit. - In mi_switch(), set PS_XCPU when the CPU time limit is exceeded. - Perform actual CPU time limit exceeded work in ast() when PS_XCPU is set. Requested by: many
# 104233	30-Sep-2002	jmallett	First half of implementation of ksiginfo, signal queues, and such. This gets signals operating based on a TailQ, and is good enough to run X11, GNOME, and do job control. There are some intricate parts which could be more refined to match the sigset_t versions, but those require further evaluation of directions in which our signal system can expand and contract to fit our needs. After this has been in the tree for a while, I will make in kernel API changes, most notably to trapsignal(9) and sendsig(9), to use ksiginfo more robustly, such that we can actually pass information with our (queued) signals to the userland. That will also result in using a struct ksiginfo pointer, rather than a signal number, in a lot of kern_sig.c, to refer to an individual pending signal queue member, but right now there is no defined behaviour for such. CODAFS is unfinished in this regard because the logic is unclear in some places. Sponsored by: New Gold Technology Reviewed by: bde, tjr, jake [an older version, logic similar]
# 104157	29-Sep-2002	julian	Implement basic KSE loaning. This stops a hread that is blocked in BOUND mode from stopping another thread from completing a syscall, and this allows it to release its resources etc. Probably more related commits to follow (at least one I know of) Initial concept by: julian, dillon Submitted by: davidxu
# 104103	28-Sep-2002	phk	Make P_MAGIC fit in p_magic.
# 104085	28-Sep-2002	phk	Fix two style problems which made FlexeLint unhappy: Don't use zero-dimension array in struct pargs. Comma after the last element of an enum doesn't make sense.
# 104079	28-Sep-2002	julian	Place 'completed thread anchor' in pre-zero'd secion of the KSEGRP structure, not the copied section.
# 104031	27-Sep-2002	julian	Redo how completing threads pass their state to userland if they are not going to cross over themselves. Also change how the list of completed user threads is tracked and passed to the KSE. This is not a change in design but rather the implementation of what was originally envisionned.
# 103972	25-Sep-2002	archie	Make the following name changes to KSE related functions, etc., to better represent their purpose and minimize namespace conflicts: kse_fn_t -> kse_func_t struct thread_mailbox -> struct kse_thr_mailbox thread_interrupt() -> kse_thr_interrupt() kse_yield() -> kse_release() kse_new() -> kse_create() Add missing declaration of kse_thr_interrupt() to <sys/kse.h>. Regenerate the various generated syscall files. Minor style fixes. Reviewed by: julian
# 103851	23-Sep-2002	julian	Remove a bunch of stuff that is surplus now
# 103843	23-Sep-2002	julian	White space commit... get prompted by Peter's commit to fix spaces that should be tabs in #defines
# 103838	23-Sep-2002	julian	slightly clean up the thread_userret() and thread_consider_upcall() calls. also some slight changes for TDF_BOUND testing and small style changes Should ONLY affect KSE programs Submitted by: davidxu
# 103411	16-Sep-2002	mini	Add kernel support needed for the KSE-aware libpthread: - Use ucontext_t's to store KSE thread state. - Synthesize state for the UTS upon each upcall, rather than saving and copying a trapframe. - Save and restore FPU state properly in ucontext_t's. - Deliver signals to KSE-aware processes via upcall. - Rename kse mailbox structure fields to be more BSD-like. - Store the UTS's stack in struct proc in a stack_t. Reviewed by: bde, deischen, julian Approved by: -arch
# 103367	15-Sep-2002	julian	Allocate KSEs and KSEGRPs separatly and remove them from the proc structure. next step is to allow > 1 to be allocated per process. This would give multi-processor threads. (when the rest of the infrastructure is in place) While doing this I noticed libkvm and sys/kern/kern_proc.c:fill_kinfo_proc are diverging more than they should.. corrective action needed soon.
# 103336	15-Sep-2002	jmallett	Make a comment reflect less of a lie, NOCPU is used to generally mean that we are not on a specific CPU in a distributed system, certainly not just for <struct proc>.p_oncpu, which is of course now <struct kse>.ke_oncpu.
# 103220	11-Sep-2002	julian	correct another spammage.. sorry bruce.. not exactly sure how my patch reverted out your change but hopefully that's it..
# 103219	11-Sep-2002	julian	revert a line that was not part of my change.. I think it was a part of someone else's commit that somehow got reverted by my patch.
# 103217	11-Sep-2002	julian	Comment and whitespace changes.
# 103216	11-Sep-2002	julian	Completely redo thread states. Reviewed by: davidxu@freebsd.org
# 103182	10-Sep-2002	bde	Fixed namespace pollution in uma changes: - use `struct uma_zone *' instead of uma_zone_t, so that <sys/uma.h> isn't a prerequisite. - don't include <sys/uma.h>. Namespace pollution makes "opaque" types like uma_zone_t perfectly non-opaque. Such types should never be used (see style(9)).
# 103049	06-Sep-2002	peter	Zap the implementations of the i386-aout specific cpu_coredump function. Most of the non-i386 platforms had rather broken implementations anyway.
# 103002	06-Sep-2002	julian	Use UMA as a complex object allocator. The process allocator now caches and hands out complete process structures including substructures . i.e. it get's the process structure with the first thread (and soon KSE) already allocated and attached, all in one hit. For the average non threaded program (non KSE that is) the allocated thread and its stack remain attached to the process, even when the process is unused and in the process cache. This saves having to allocate and attach it later, effectively bringing us (hopefully) close to the efficiency of pre-KSE systems where these were a single structure. Reviewed by: davidxu@freebsd.org, peter@freebsd.org
# 102950	05-Sep-2002	davidxu	s/SGNL/SIG/ s/SNGL/SINGLE/ s/SNGLE/SINGLE/ Fix abbreviation for P_STOPPED_* etc flags, in original code they were inconsistent and difficult to distinguish between them. Approved by: julian (mentor)
# 102898	03-Sep-2002	davidxu	In the kernel code, we have the tsleep() call with the PCATCH argument. PCATCH means 'if we get a signal, interrupt me!" and tsleep returns either EINTR or ERESTART depending on the circumstances. ERESTART is "special" because it causes the system call to fail, but right as it returns back to userland it tells the trap handler to move %eip back a bit so that userland will immediately re-run the syscall. This is a syscall restart. It only works for things like read() etc where nothing has changed yet. Note that userland is tricked into restarting the syscall by the kernel. The kernel doesn't actually do the restart. It is deadly for things like select, poll, nanosleep etc where it might cause the elapsed time to be reset and start again from scratch. So those syscalls do this to prevent userland rerunning the syscall: if (error == ERESTART) error = EINTR; Fake "signals" like SIGTSTP from ^Z etc do not normally invoke userland signal handlers. But, in -current, the PCATCH is being triggered and tsleep is returning ERESTART, and the syscall is aborted even though no userland signal handler was run. That is the fault here. We're triggering the PCATCH in cases that we shouldn't. ie: it is being triggered on any signal processing, rather than the case where the signal is posted to userland. --- Peter The work of psignal() is a patchwork of special case required by the process debugging and job-control facilities... --- Kirk McKusick "The design and impelementation of the 4.4BSD Operating system" Page 105 in STABLE source, when psignal is posting a STOP signal to sleeping process and the signal action of the process is SIG_DFL, system will directly change the process state from SSLEEP to SSTOP, and when SIGCONT is posted to the stopped process, if it finds that the process is still on sleep queue, the process state will be restored to SSLEEP, and won't wakeup the process. this commit mimics the behaviour in STABLE source tree. Reviewed by: Jon Mini, Tim Robbins, Peter Wemm Approved by: julian@freebsd.org (mentor)
# 102592	29-Aug-2002	julian	Rejig the code to figure out estcpu and work out how long a KSEGRP has been idle. What was there before was surprisingly ALMOST correct. Peter and I fried our brains on this for a couple of hours figuring out what this actually means in the context of multiple threads. Reviewed by: peter@freebsd.org
# 102544	28-Aug-2002	peter	updatepri() works on a ksegrp (where the scheduling parameters are), so directly give it the ksegrp instead of the thread. The only thing it used to use in the thread was the ksegrp. Reviewed by: julian
# 102416	25-Aug-2002	julian	Fix a couple of typos in comments. Submitted by: Jake Burkholder <jake@locore.ca>
# 102396	25-Aug-2002	julian	Reformat some comments to fit in 80 columns and rewrite some comments that have 'aged' poorly.
# 102327	23-Aug-2002	julian	Add the complex state TDS_SUSP_SLP. This state is to allow some experimentation and not YET used.. The theory is that a thread that is about to sleep is placed on the sleep queue and then discovers it should suspend, and is placed on suspend queue. (these are separate queues and it can be on both). It will not become runnable until it has been removed from BOTH queues. i.e. a wakeup event has occured AND the process has been unsuspended. If it were not on the sleep queue when suspended, then the (possibly only) wakeup event might arrive and not find any process to wake up. this would result in the thread sleeping 'forever' when the suspension is lifted. This state will transition to one of TDS_SLP or TDS_SUSPENDED, depending upon which constraint is lifted first.
# 101499	08-Aug-2002	julian	EAK! two status flags in teh proc structure were defined to the same value! Picked up by: David Xu <bsddiy@yahoo.com>
# 100913	30-Jul-2002	tanimura	- Optimize wakeup() and its friends; if a thread waken up is being swapped in, we do not have to ask for the scheduler thread to do that. - Assert that a process is not swapped out in runq functions and swapout(). - Introduce thread_safetoswapout() for readability. - In swapout_procs(), perform a test that may block (check of a thread working on its vm map) first. This lets us call swapout() with the sched_lock held, providing a better atomicity.
# 100884	29-Jul-2002	julian	Create a new thread state to describe threads that would be ready to run except for the fact tha they are presently swapped out. Also add a process flag to indicate that the process has started the struggle to swap back in. This will be needed for the case where multiple threads start the swapin action top a collision. Also add code to stop a process fropm being swapped out if one of the threads in this process is actually off running on another CPU.. that might hurt... Submitted by: Seigo Tanimura <tanimura@r.dl.itc.u-tokyo.ac.jp>
# 100645	24-Jul-2002	julian	Fix a typo Cleanup a define Correct locking notes (a bit) Add a definition and a field for future work on thread suspension.
# 100209	17-Jul-2002	gallatin	Allow alphas to do crashdumps: Refuse to run anything in choosethread() after a panic which is not an interrupt thread, or the thread which caused the panic. Also, remove panicstr checks from msleep() and from cv_wait() in order to allow threads to go to sleep and yeild the cpu to the panicing thread, or to an interrupt thread which might be doing the crashdump. Reviewed by: jhb (and it was mostly his idea too)
# 99942	14-Jul-2002	julian	Thinking about it I came to the conclusion that the KSE states were incorrectly formulated. The correct states should be: IDLE: On the idle KSE list for that KSEG RUNQ: Linked onto the system run queue. THREAD: Attached to a thread and slaved to whatever state the thread is in. This means that most places where we were adjusting kse state can go away as it is just moving around because the thread is.. The only places we need to adjust the KSE state is in transition to and from the idle and run queues. Reviewed by: jhb@freebsd.org
# 99890	12-Jul-2002	dillon	Re-enable the idle page-zeroing code. Remove all IPIs from the idle page-zeroing code as well as from the general page-zeroing code and use a lazy tlb page invalidation scheme based on a callback made at the end of mi_switch. A number of people came up with this idea at the same time so credit belongs to Peter, John, and Jake as well. Two-way SMP buildworld -j 5 tests (second run, after stabilization) 2282.76 real 2515.17 user 704.22 sys before peter's IPI commit 2266.69 real 2467.50 user 633.77 sys after peter's commit 2232.80 real 2468.99 user 615.89 sys after this commit Reviewed by: peter, jhb Approved by: peter
# 99565	07-Jul-2002	peter	Remove OBE prototype for procrunnable()
# 99072	29-Jun-2002	julian	Part 1 of KSE-III The ability to schedule multiple threads per process (one one cpu) by making ALL system calls optionally asynchronous. to come: ia64 and power-pc patches, patches for gdb, test program (in tools) Reviewed by: Almost everyone who counts (at various times, peter, jhb, matt, alfred, mini, bernd, and a cast of thousands) NOTE: this is still Beta code, and contains lots of debugging stuff. expect slight instability in signals..
# 98768	24-Jun-2002	markm	Fix a GCCism. int foo[0]; // dodgy int foo[]; // means the same, works the same and survives lint. Tested by: 3 months of use on my laptop
# 98765	24-Jun-2002	jake	Add an MD callout like cpu_exit, but which is called after sched_lock is obtained, when all other scheduling activity is suspended. This is needed on sparc64 to deactivate the vmspace of the exiting process on all cpus. Otherwise if another unrelated process gets the exact same vmspace structure allocated to it (same address), its address space will not be activated properly. This seems to fix some spontaneous signal 11 problems with smp on sparc64.
# 97986	07-Jun-2002	jhb	- Add a per-thread member 'td_inktrace' to be used by ktrace to detect when a thread is in the ktrace subsystem to avoid ktrace'ing internal ktrace events. - Update the locking notes for p_traceflag and p_tracep taking into account the new ktrace_lock mutex.
# 97714	01-Jun-2002	mike	Add POSIX.1-2001 WCONTINUED option for waitpid(2). A proc flag (P_CONTINUED) is set when a stopped process receives a SIGCONT and cleared after it has notified a parent process that has requested notification via waitpid(2) with WCONTINUED specified in its options operand. The status value can be checked with the new WIFCONTINUED() macro. Reviewed by: jake
# 96886	18-May-2002	jhb	Change p_can{debug,see,sched,signal}()'s first argument to be a thread pointer instead of a proc pointer and require the process pointed to by the second argument to be locked. We now use the thread ucred reference for the credential checks in p_can*() as a result. p_canfoo() should now no longer need Giant.
# 96354	10-May-2002	jhb	p_leader is only set at fork1() time, so update its locking note appropriately.
# 96244	09-May-2002	mini	Remove trace_req(). Reviewed by: alfred, jhb, peter
# 95875	01-May-2002	jhb	Axe unused SESS_UNLOCK_NOSWITCH() and PGRP_UNLOCK_NOSWITCH() macros. The MTX_NOSWITCH flag was deprecated a while ago.
# 95594	27-Apr-2002	iedowse	Avoid the user-visible effect of setting SA_NOCLDWAIT when the SIGCHLD handler is SIG_IGN. This is a reimplementation of the problematic revision 1.131 of kern_exit.c. To avoid accessing process UPAGES, we set a new procsig flag when the SIGCHLD handler is SIG_IGN and use that instead.
# 94857	16-Apr-2002	jhb	- Merge the pgrpsess_lock and proctree_lock sx locks into one proctree_lock sx lock. Trying to get the lock order between these locks was getting too complicated as the locking in wait1() was being fixed. - leavepgrp() now requires an exclusive lock of proctree_lock to be held when it is called. - fixjobc() no longer gets a shared lock of proctree_lock now that it requires an xlock be held by the caller. - Locking notes in sys/proc.h are adjusted to note that everything that used to be protected by the pgrpsess_lock is now protected by the proctree_lock.
# 93793	04-Apr-2002	bde	Moved signal handling and rescheduling from userret() to ast() so that they aren't in the usual path of execution for syscalls and traps. The main complication for this is that we have to set flags to control ast() everywhere that changes the signal mask. Avoid locking in userret() in most of the remaining cases. Submitted by: luoqi (first part only, long ago, reorganized by me) Reminded by: dillon
# 93348	28-Mar-2002	alfred	To remove nested include of sys/lock.h and sys/mutex.h from sys/proc.h make the pargs_* functions into non-inlines in kern/kern_proc.c. Requested by: bde
# 93295	27-Mar-2002	alfred	Make the reference counting of 'struct pargs' SMP safe. There is still some locations where the PROC lock should be held in order to prevent inconsistent views from outside (like the proc->p_fd fix for kern/vfs_syscalls.c:checkdirs()) that can be fixed later. Submitted by: Jonathan Mini <mini@haikugeek.com>
# 93264	27-Mar-2002	dillon	Compromise for critical()/cpu_critical() recommit. Cleanup the interrupt disablement assumptions in kern_fork.c by adding another API call, cpu_critical_fork_exit(). Cleanup the td_savecrit field by moving it from MI to MD. Temporarily move cpu_critical() from <arch>/include/cpufunc.h to <arch>/<arch>/critical.c (stage-2 will clean this up). Implement interrupt deferral for i386 that allows interrupts to remain enabled inside critical sections. This also fixes an IPI interlock bug, and requires uses of icu_lock to be enclosed in a true interrupt disablement. This is the stage-1 commit. Stage-2 will occur after stage-1 has stabilized, and will move cpu_critical() into its own header file(s) + other things. This commit may break non-i386 architectures in trivial ways. This should be temporary. Reviewed by: core Approved by: core
# 92824	20-Mar-2002	jhb	Change the way we ensure td_ucred is NULL if DIAGNOSTIC is defined. Instead of caching the ucred reference, just go ahead and eat the decerement and increment of the refcount. Now that Giant is pushed down into crfree(), we no longer have to get Giant in the common case. In the case when we are actually free'ing the ucred, we would normally free it on the next kernel entry, so the cost there is not new, just in a different place. This also removse td_cache_ucred from struct thread. This is still only done #ifdef DIAGNOSTIC. Tested on: i386, alpha
# 92751	20-Mar-2002	jeff	Remove references to vm_zone.h and switch over to the new uma API. Also, remove maxsockets. If you look carefully you'll notice that the old zone allocator never honored this anyway.
# 92719	19-Mar-2002	alfred	Remove __P
# 92654	19-Mar-2002	jeff	This is the first part of the new kernel memory allocator. This replaces malloc(9) and vm_zone with a slab like allocator. Reviewed by: arch@
# 92252	13-Mar-2002	alfred	Fixes to make select/poll mpsafe. Problem: selwakeup required calling pfind which would cause lock order reversals with the allproc_lock and the per-process filedesc lock. Solution: Instead of recording the pid of the select()'ing process into the selinfo structure, actually record a pointer to the thread. To avoid dereferencing a bad address all the selinfo structures that are in use by a thread are kept in a list hung off the thread (protected by sellock). When a selwakeup occurs the selinfo is removed from that threads list, it is also removed on the way out of select or poll where the thread will traverse its list removing all the selinfos from its own list. Problem: Previously the PROC_LOCK was used to provide the mutual exclusion needed to ensure proper locking, this couldn't work because there was a single condvar used for select and poll and condvars can only be used with a single mutex. Solution: Introduce a global mutex 'sellock' which is used to provide mutual exclusion when recording events to wait on as well as performing notification when an event occurs. Interesting note: schedlock is required to manipulate the per-thread TDF_SELECT flag, however if given its own field it would not need schedlock, also because TDF_SELECT is only manipulated under sellock one doesn't actually use schedlock for syncronization, only to protect against corruption. Proc locks are no longer used in select/poll. Portions contributed by: davidc
# 91140	23-Feb-2002	tanimura	Lock struct pgrp, session and sigio. New locks are: - pgrpsess_lock which locks the whole pgrps and sessions, - pg_mtx which protects the pgrp members, and - s_mtx which protects the session members. Please refer to sys/proc.h for the coverage of these locks. Changes on the pgrp/session interface: - pgfind() needs the pgrpsess_lock held. - The caller of enterpgrp() is responsible to allocate a new pgrp and session. - Call enterthispgrp() in order to enter an existing pgrp. - pgsignal() requires a pgrp lock held. Reviewed by: jhb, alfred Tested on: cvsup.jp.FreeBSD.org (which is a quad-CPU machine running -current)
# 91090	22-Feb-2002	julian	Add some DIAGNOSTIC code. While in userland, keep the thread's ucred reference in a shadow field so that the usual place to store it is NULL. If DIAGNOSTIC is not set, the thread ucred is kept valid until the next kernel entry, at which time it is checked against the process cred and possibly corrected. Produces a BIG speedup in kernels with INVARIANTS set. (A previous commit corrected it for the non INVARIANTS case already) Reviewed by: dillon@freebsd.org
# 91066	22-Feb-2002	phk	Convert p->p_runtime and PCPU(switchtime) to bintime format.
# 90829	18-Feb-2002	julian	Remove __P() before committing new prototypes with KSE stuff in a couple of weeks.
# 90640	13-Feb-2002	bde	Fixed sign extension bugs in previous commit. They didn't completely break scheduling because negative priorities were most fixed up by converting kg_pri_user back to the correct type. Fixed some style bugs in previous commit (non-terminated sentence fragments and regressions in comments).
# 90538	11-Feb-2002	julian	In a threaded world, differnt priorirites become properties of different entities. Make it so. Reviewed by: jhb@freebsd.org (john baldwin)
# 90361	07-Feb-2002	julian	Pre-KSE/M3 commit. this is a low-functionality change that changes the kernel to access the main thread of a process via the linked list of threads rather than assuming that it is embedded in the process. It IS still embeded there but remove all teh code that assumes that in preparation for the next commit which will actually move it out. Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
# 90103	02-Feb-2002	rwatson	o Add the prototype of cr_cansignal() from an earlier change to kern_prot.c. This has apparently been sitting in my local tree for ages, and has been generating a warning during the building of kern_prot.o. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 90033	31-Jan-2002	dillon	GC P_BUFEXHAUST leftovers, we've had a new mechanism to avoid buffer cache lockups for over a year now. MFC after: 0 days
# 88900	05-Jan-2002	jhb	Change the preemption code for software interrupt thread schedules and mutex releases to not require flags for the cases when preemption is not allowed: The purpose of the MTX_NOSWITCH and SWI_NOSWITCH flags is to prevent switching to a higher priority thread on mutex releease and swi schedule, respectively when that switch is not safe. Now that the critical section API maintains a per-thread nesting count, the kernel can easily check whether or not it should switch without relying on flags from the programmer. This fixes a few bugs in that all current callers of swi_sched() used SWI_NOSWITCH, when in fact, only the ones called from fast interrupt handlers and the swi_sched of softclock needed this flag. Note that to ensure that swi_sched()'s in clock and fast interrupt handlers do not switch, these handlers have to be explicitly wrapped in critical_enter/exit pairs. Presently, just wrapping the handlers is sufficient, but in the future with the fully preemptive kernel, the interrupt must be EOI'd before critical_exit() is called. (critical_exit() can switch due to a deferred preemption in a fully preemptive kernel.) I've tested the changes to the interrupt code on i386 and alpha. I have not tested ia64, but the interrupt code is almost identical to the alpha code, so I expect it will work fine. PowerPC and ARM do not yet have interrupt code in the tree so they shouldn't be broken. Sparc64 is broken, but that's been ok'd by jake and tmm who will be fixing the interrupt code for sparc64 shortly. Reviewed by: peter Tested on: i386, alpha
# 88088	17-Dec-2001	jhb	Modify the critical section API as follows: - The MD functions critical_enter/exit are renamed to start with a cpu_ prefix. - MI wrapper functions critical_enter/exit maintain a per-thread nesting count and a per-thread critical section saved state set when entering a critical section while at nesting level 0 and restored when exiting to nesting level 0. This moves the saved state out of spin mutexes so that interlocking spin mutexes works properly. - Most low-level MD code that used critical_enter/exit now use cpu_critical_enter/exit. MI code such as device drivers and spin mutexes use the MI wrappers. Note that since the MI wrappers store the state in the current thread, they do not have any return values or arguments. - mtx_intr_enable() is replaced with a constant CRITICAL_FORK which is assigned to curthread->td_savecrit during fork_exit(). Tested on: i386, alpha
# 87793	13-Dec-2001	jhb	Use a per-thread variable for keeping state when a thread is processing a KTR log entry. Any KTR requests made while working on an entry are ignored/discarded to prevent recursion. This is a better fix for the hack to futz with the CPU mask and call getnanotime() if KTR_LOCK or KTR_WITNESS was on. It also covers the actual formatting of the log entry including dumping it to the display which the earlier hacks did not.
# 87702	11-Dec-2001	jhb	Overhaul the per-CPU support a bit: - The MI portions of struct globaldata have been consolidated into a MI struct pcpu. The MD per-CPU data are specified via a macro defined in machine/pcpu.h. A macro was chosen over a struct mdpcpu so that the interface would be cleaner (PCPU_GET(my_md_field) vs. PCPU_GET(md.md_my_md_field)). - All references to globaldata are changed to pcpu instead. In a UP kernel, this data was stored as global variables which is where the original name came from. In an SMP world this data is per-CPU and ideally private to each CPU outside of the context of debuggers. This also included combining machine/globaldata.h and machine/globals.h into machine/pcpu.h. - The pointer to the thread using the FPU on i386 was renamed from npxthread to fpcurthread to be identical with other architectures. - Make the show pcpu ddb command MI with a MD callout to display MD fields. - The globaldata_register() function was renamed to pcpu_init() and now init's MI fields of a struct pcpu in addition to registering it with the internal array and list. - A pcpu_destroy() function was added to remove a struct pcpu from the internal array and list. Tested on: alpha, i386 Reviewed by: peter, jake
# 87540	08-Dec-2001	des	p_trespass() has been dead for over a year.
# 85750	30-Oct-2001	jhb	Threads sit on condition variable wait queue's, not proceses (sic).
# 85598	27-Oct-2001	des	Add a P_INEXEC flag that indicates that the process has called execve() and it has not yet returned. Use this flag to deny debugging requests while the process is execve()ing, and close once and for all any race conditions that might occur between execve() and various debugging interfaces. Reviewed by: jhb, rwatson
# 85525	26-Oct-2001	jhb	Add a per-thread ucred reference for syscalls and synchronous traps from userland. The per thread ucred reference is immutable and thus needs no locks to be read. However, until all the proc locking associated with writes to p_ucred are completed, it is still not safe to use the per-thread reference. Tested on: x86 (SMP), alpha, sparc64
# 85446	24-Oct-2001	julian	re-undo rev 1.78 now that style(9) is sane in this regard, (make struct {proc,thread,kse,ksgrp} readable again.)
# 85316	22-Oct-2001	des	Upon further reflection, back out previous commit, partly for the reasons Bruce stated and partly because it introduces gratuitous incompatibilities with -STABLE.
# 85300	22-Oct-2001	des	Move the stop event macros from pioctl.h to proc.h, and add an S_ALLSTOPS macro to represent "all stop events".
# 84635	07-Oct-2001	des	These flags aren't just for procfs - in fact, these days theye are primarily used by ptrace(2) - so tweak the accompanying comments a little.
# 84046	27-Sep-2001	jhb	Restore this file to style(9). Mostly consists of whitespace fixes in the structure definitions. There were some older whitespace bogons as well.
# 83639	18-Sep-2001	rwatson	o Introduce two new calls, securelevel_gt() and securelevel_ge(), which abstract the securelevel implementation details from the checking code. The call in -CURRENT accepts a struct ucred--in -STABLE, it will accept struct proc. This facilitates the upcoming commit of per-jail securelevel support. The calls will also generate a kernel printf if the calls are made with NULL ucred/proc pointers: generally speaking, there are few instances of this, and they should be fixed. o Update p_candebug() to use securelevel_gt(); future updates to the remainder of the kernel tree will be committed soon. Obtained from: TrustedBSD Project
# 83421	13-Sep-2001	obrien	Re-apply rev 1.178 -- style(9) the structure definitions. I have to wonder how many other changes were lost in the KSE mildstone 2 merge.
# 83419	13-Sep-2001	jhb	Adjust some locking comments.
# 83416	13-Sep-2001	julian	shift a few flags around.. (physically, not logically)
# 83366	12-Sep-2001	julian	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 83276	10-Sep-2001	peter	Rip some well duplicated code out of cpu_wait() and cpu_exit() and move it to the MI area. KSE touched cpu_wait() which had the same change replicated five ways for each platform. Now it can just do it once. The only MD parts seemed to be dealing with fpu state cleanup and things like vm86 cleanup on x86. The rest was identical. XXX: ia64 and powerpc did not have cpu_throw(), so I've put a functional stub in place. Reviewed by: jake, tmm, dillon
# 83045	04-Sep-2001	obrien	style(9) the structure definitions.
# 82611	30-Aug-2001	dillon	Get rid of most of the GIANT_XXX assertion defines. Nobody is going to use them, including me.
# 82472	28-Aug-2001	rwatson	o Remove P_CAN_* constants, as they are no longer being used. Obtained from: TrustedBSD Project
# 82085	21-Aug-2001	jhb	- Fix a bug in the previous workaround for the tsleep/endtsleep race. callout_stop() would fail in two cases: 1) The timeout was currently executing, and 2) The timeout had already executed. We only needed to work around the race for 1). We caught some instances of 2) via the PS_TIMEOUT flag, however, if endtsleep() fired after the process had been woken up but before it had resumed execution, PS_TIMEOUT would not be set, but callout_stop() would fail, so we would block the process until endtsleep() resumed it. Except that endtsleep() had already run and couldn't resume it. This adds a new flag PS_TIMOFAIL to indicate the case of 2) when PS_TIMEOUT isn't set. - Implement this race fix for condition variables as well. Tested by: sos
# 81493	10-Aug-2001	jhb	- Close races with signals and other AST's being triggered while we are in the process of exiting the kernel. The ast() function now loops as long as the PS_ASTPENDING or PS_NEEDRESCHED flags are set. It returns with preemption disabled so that any further AST's that arrive via an interrupt will be delayed until the low-level MD code returns to user mode. - Use u_int's to store the tick counts for profiling purposes so that we do not need sched_lock just to read p_sticks. This also closes a problem where the call to addupc_task() could screw up the arithmetic due to non-atomic reads of p_sticks. - Axe need_proftick(), aston(), astoff(), astpending(), need_resched(), clear_resched(), and resched_wanted() in favor of direct bit operations on p_sflag. - Fix up locking with sched_lock some. In addupc_intr(), use sched_lock to ensure pr_addr and pr_ticks are updated atomically with setting PS_OWEUPC. In ast() we clear pr_ticks atomically with clearing PS_OWEUPC. We also do not grab the lock just to test a flag. - Simplify the handling of Giant in ast() slightly. Reviewed by: bde (mostly)
# 81399	10-Aug-2001	jhb	- Remove asleep(), await(), and M_ASLEEP. - Callers of asleep() and await() have been converted to calling tsleep(). The only caller outside of M_ASLEEP was the ata driver, which called both asleep() and await() with spl-raised, so there was no need for the asleep() and await() pair. M_ASLEEP was unused. Reviewed by: jasone, peter
# 79335	05-Jul-2001	rwatson	o Replace calls to p_can(..., P_CAN_xxx) with calls to p_canxxx(). The p_can(...) construct was a premature (and, it turns out, awkward) abstraction. The individual calls to p_canxxx() better reflect differences between the inter-process authorization checks, such as differing checks based on the type of signal. This has a side effect of improving code readability. o Replace direct credential authorization checks in ktrace() with invocation of p_candebug(), while maintaining the special case check of KTR_ROOT. This allows ktrace() to "play more nicely" with new mandatory access control schemes, as well as making its authorization checks consistent with other "debugging class" checks. o Eliminate "privused" construct for p_can*() calls which allowed the caller to determine if privilege was required for successful evaluation of the access control check. This primitive is currently unused, and as such, serves only to complicate the API. Approved by: ({procfs,linprocfs} changes) des Obtained from: TrustedBSD Project
# 79225	04-Jul-2001	dillon	cleanup: GIANT macros, rename DEPRECIATE to DEPRECATE Move p_giant_optional to proc zero'd section Remove (old) XXX zfree comment in pipe code
# 79224	04-Jul-2001	dillon	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.
# 79006	30-Jun-2001	jhb	Remove the p_spinlocks spin lock count that was obsoleted by the per-CPU spinlocks list.
# 78983	29-Jun-2001	jhb	Move ast() and userret() to sys/kern/subr_trap.c now that they are MI.
# 78962	29-Jun-2001	jhb	Add a new MI pointer to the process' trapframe p_frame instead of using various differently named pointers buried under p_md. Reviewed by: jake (in principle)
# 78116	11-Jun-2001	des	Say one thing, do the other... nextpid -> lastpid
# 78112	11-Jun-2001	des	Rename nextpid to lastpid and externalize it.
# 77183	25-May-2001	rwatson	o Merge contents of struct pcred into struct ucred. Specifically, add the real uid, saved uid, real gid, and saved gid to ucred, as well as the pcred->pc_uidinfo, which was associated with the real uid, only rename it to cr_ruidinfo so as not to conflict with cr_uidinfo, which corresponds to the effective uid. o Remove p_cred from struct proc; add p_ucred to struct proc, replacing original macro that pointed. p->p_ucred to p->p_cred->pc_ucred. o Universally update code so that it makes use of ucred instead of pcred, p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo, cr_{r,sv}{u,g}id instead of p_*, etc. o Remove pcred0 and its initialization from init_main.c; initialize cr_ruidinfo there. o Restruction many credential modification chunks to always crdup while we figure out locking and optimizations; generally speaking, this means moving to a structure like this: newcred = crdup(oldcred); ... p->p_ucred = newcred; crfree(oldcred); It's not race-free, but better than nothing. There are also races in sys_process.c, all inter-process authorization, fork, exec, and exit. o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid; remove comments indicating that the old arrangement was a problem. o Restructure exec1() a little to use newcred/oldcred arrangement, and use improved uid management primitives. o Clean up exit1() so as to do less work in credential cleanup due to pcred removal. o Clean up fork1() so as to do less work in credential cleanup and allocation. o Clean up ktrcanset() to take into account changes, and move to using suser_xxx() instead of performing a direct uid==0 comparision. o Improve commenting in various kern_prot.c credential modification calls to better document current behavior. In a couple of places, current behavior is a little questionable and we need to check POSIX.1 to make sure it's "right". More commenting work still remains to be done. o Update credential management calls, such as crfree(), to take into account new ruidinfo reference. o Modify or add the following uid and gid helper routines: change_euid() change_egid() change_ruid() change_rgid() change_svuid() change_svgid() In each case, the call now acts on a credential not a process, and as such no longer requires more complicated process locking/etc. They now assume the caller will do any necessary allocation of an exclusive credential reference. Each is commented to document its reference requirements. o CANSIGIO() is simplified to require only credentials, not processes and pcreds. o Remove lots of (p_pcred==NULL) checks. o Add an XXX to authorization code in nfs_lock.c, since it's questionable, and needs to be considered carefully. o Simplify posix4 authorization code to require only credentials, not processes and pcreds. Note that this authorization, as well as CANSIGIO(), needs to be updated to use the p_cansignal() and p_cansched() centralized authorization routines, as they currently do not take into account some desirable restrictions that are handled by the centralized routines, as well as being inconsistent with other similar authorization instances. o Update libkvm to take these changes into account. Obtained from: TrustedBSD Project Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit
# 76649	15-May-2001	jhb	Add a PROC_TRYLOCK() macro to perform a mtx_trylock() on the process lock.
# 76488	11-May-2001	jhb	Add a new macro to test if a process' proc lock is held by the current thread: PROC_LOCKED().
# 76078	27-Apr-2001	jhb	Overhaul of the SMP code. Several portions of the SMP kernel support have been made machine independent and various other adjustments have been made to support Alpha SMP. - It splits the per-process portions of hardclock() and statclock() off into hardclock_process() and statclock_process() respectively. hardclock() and statclock() call the _process() functions for the current process so that UP systems will run as before. For SMP systems, it is simply necessary to ensure that all other processors execute the _process() functions when the main clock functions are triggered on one CPU by an interrupt. For the alpha 4100, clock interrupts are delievered in a staggered broadcast fashion, so we simply call hardclock/statclock on the boot CPU and call the _process() functions on the secondaries. For x86, we call statclock and hardclock as usual and then call forward_hardclock/statclock in the MD code to send an IPI to cause the AP's to execute forwared_hardclock/statclock which then call the _process() functions. - forward_signal() and forward_roundrobin() have been reworked to be MI and to involve less hackery. Now the cpu doing the forward sets any flags, etc. and sends a very simple IPI_AST to the other cpu(s). AST IPIs now just basically return so that they can execute ast() and don't bother with setting the astpending or needresched flags themselves. This also removes the loop in forward_signal() as sched_lock closes the race condition that the loop worked around. - need_resched(), resched_wanted() and clear_resched() have been changed to take a process to act on rather than assuming curproc so that they can be used to implement forward_roundrobin() as described above. - Various other SMP variables have been moved to a MI subr_smp.c and a new header sys/smp.h declares MI SMP variables and API's. The IPI API's from machine/ipl.h have moved to machine/smp.h which is included by sys/smp.h. - The globaldata_register() and globaldata_find() functions as well as the SLIST of globaldata structures has become MI and moved into subr_smp.c. Also, the globaldata list is only available if SMP support is compiled in. Reviewed by: jake, peter Looked over by: eivind
# 75739	20-Apr-2001	alfred	add a comment to note that a process's vmspace may change, but so far only aiod does this and is also marked P_SYSTEM, the locations that reference p->p_vmspace usually do it within the context of the caller, the async access from the vm system is protected by the fact that it will skip over P_SYSTEM processes. Ok'd by: jhb
# 75631	17-Apr-2001	alfred	Implement client side NFS locks. Obtained from: BSD/os Import Ok'd by: mckusick, jkh, motd on builder.freebsd.org
# 75437	12-Apr-2001	rwatson	o Replace p_cankill() with p_cansignal(), remove wrappage of p_can() from signal authorization checking. o p_cansignal() takes three arguments: subject process, object process, and signal number, unlike p_cankill(), which only took into account the processes and not the signal number, improving the abstraction such that CANSIGNAL() from kern_sig.c can now also be eliminated; previously CANSIGNAL() special-cased the handling of SIGCONT based on process session. privused is now deprecated. o The new p_cansignal() further limits the set of signals that may be delivered to processes with P_SUGID set, and restructures the access control check to allow it to be extended more easily. o These changes take into account work done by the OpenBSD Project, as well as by Robert Watson and Thomas Moestl on the TrustedBSD Project. Obtained from: TrustedBSD Project
# 74927	28-Mar-2001	jhb	Convert the allproc and proctree locks from lockmgr locks to sx locks.
# 74912	28-Mar-2001	jhb	Rework the witness code to work with sx locks as well as mutexes. - Introduce lock classes and lock objects. Each lock class specifies a name and set of flags (or properties) shared by all locks of a given type. Currently there are three lock classes: spin mutexes, sleep mutexes, and sx locks. A lock object specifies properties of an additional lock along with a lock name and all of the extra stuff needed to make witness work with a given lock. This abstract lock stuff is defined in sys/lock.h. The lockmgr constants, types, and prototypes have been moved to sys/lockmgr.h. For temporary backwards compatability, sys/lock.h includes sys/lockmgr.h. - Replace proc->p_spinlocks with a per-CPU list, PCPU(spinlocks), of spin locks held. By making this per-cpu, we do not have to jump through magic hoops to deal with sched_lock changing ownership during context switches. - Replace proc->p_heldmtx, formerly a list of held sleep mutexes, with proc->p_sleeplocks, which is a list of held sleep locks including sleep mutexes and sx locks. - Add helper macros for logging lock events via the KTR_LOCK KTR logging level so that the log messages are consistent. - Add some new flags that can be passed to mtx_init(): - MTX_NOWITNESS - specifies that this lock should be ignored by witness. This is used for the mutex that blocks a sx lock for example. - MTX_QUIET - this is not new, but you can pass this to mtx_init() now and no events will be logged for this lock, so that one doesn't have to change all the individual mtx_lock/unlock() operations. - All lock objects maintain an initialized flag. Use this flag to export a mtx_initialized() macro that can be safely called from drivers. Also, we on longer walk the all_mtx list if MUTEX_DEBUG is defined as witness performs the corresponding checks using the initialized flag. - The lock order reversal messages have been improved to output slightly more accurate file and line numbers.
# 74904	28-Mar-2001	jhb	- Fix a whitespace bogon with p_blocked. - Move p_intr_nesting_level, p_aioinfo, and p_ithd into the zero'd area so that we don't have to explicitly zero them during fork().
# 74016	09-Mar-2001	jhb	Fix mtx_legal2block. The only time that it is bad to block on a mutex is if we hold a spin mutex, since we can trivially get into deadlocks if we start switching out of processes that hold spinlocks. Checking to see if interrupts were disabled was a sort of cheap way of doing this since most of the time interrupts were only disabled when holding a spin lock. At least on the i386. To fix this properly, use a per-process counter p_spinlocks that counts the number of spin locks currently held, and instead of checking to see if interrupts are disabled in the witness code, check to see if we hold any spin locks. Since child processes always start up with the sched lock magically held in fork_exit(), we initialize p_spinlocks to 1 for child processes. Note that proc0 doesn't go through fork_exit(), so it starts with no spin locks held. Consulting from: cp
# 73904	06-Mar-2001	jhb	- In the locking key for struct proc, generalize the '+' symbol to mean that write access to a member requires both locks and read access only requires one of the given locks. Convert instances of '(c+)' to '(c + k)' as a result. - Change p_pptr from (e) to (c + e). - Change p_oppid from (c) to (c + e). - Change p_args from (b?) to (c + k). - Move the actual work of STOPEVENT, PHOLD, and PRELE to _STOPEVENT, _PHOLD, and _PRELE. The new macros do not acquire the proc lock and simply assert that it is held. The non _ prefixed macros acquire the proc lock and then call the _ prefixed macros. - Add a PROC_LOCK_NOSWITCH() macro to be used when releasing the proc lock while already holding a spin lock (usually sched_lock). - Add a PROC_LOCK_ASSERT() macro to be used to make assertions about the proc lock. It takes the usual mtx_assert() macro arguments as its second argument.
# 72786	21-Feb-2001	rwatson	o Move per-process jail pointer (p->pr_prison) to inside of the subject credential structure, ucred (cr->cr_prison). o Allow jail inheritence to be a function of credential inheritence. o Abstract prison structure reference counting behind pr_hold() and pr_free(), invoked by the similarly named credential reference management functions, removing this code from per-ABI fork/exit code. o Modify various jail() functions to use struct ucred arguments instead of struct proc arguments. o Introduce jailed() function to determine if a credential is jailed, rather than directly checking pointers all over the place. o Convert PRISON_CHECK() macro to prison_check() function. o Move jail() function prototypes to jail.h. o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the flag in the process flags field itself. o Eliminate that "const" qualifier from suser/p_can/etc to reflect mutex use. Notes: o Some further cleanup of the linux/jail code is still required. o It's now possible to consider resolving some of the process vs credential based permission checking confusion in the socket code. o Mutex protection of struct prison is still not present, and is required to protect the reference count plus some fields in the structure. Reviewed by: freebsd-arch Obtained from: TrustedBSD Project
# 72746	20-Feb-2001	jhb	- Don't call clear_resched() in userret(), instead, clear the resched flag in mi_switch() just before calling cpu_switch() so that the first switch after a resched request will satisfy the request. - While I'm at it, move a few things into mi_switch() and out of cpu_switch(), specifically set the p_oncpu and p_lastcpu members of proc in mi_switch(), and handle the sched_lock state change across a context switch in mi_switch(). - Since cpu_switch() no longer handles the sched_lock state change, we have to setup an initial state for sched_lock in fork_exit() before we release it.
# 72683	19-Feb-2001	bde	Changed the aston() family to operate on a specified process instead of always on curproc. This is needed to implement signal delivery properly (see a future log message for kern_sig.c). Debogotified the definition of aston(). aston() was defined in terms of signotify() (perhaps because only the latter already operated on a specified process), but aston() is the primitive. Similar changes are needed in the ia64 versions of cpu.h and trap.c. I didn't make them because the ia64 is missing the prerequisite changes to make astpending and need_resched per-process and those changes are too large to make without testing.
# 72376	11-Feb-2001	jake	Implement a unified run queue and adjust priority levels accordingly. - All processes go into the same array of queues, with different scheduling classes using different portions of the array. This allows user processes to have their priorities propogated up into interrupt thread range if need be. - I chose 64 run queues as an arbitrary number that is greater than 32. We used to have 4 separate arrays of 32 queues each, so this may not be optimal. The new run queue code was written with this in mind; changing the number of run queues only requires changing constants in runq.h and adjusting the priority levels. - The new run queue code takes the run queue as a parameter. This is intended to be used to create per-cpu run queues. Implement wrappers for compatibility with the old interface which pass in the global run queue structure. - Group the priority level, user priority, native priority (before propogation) and the scheduling class into a struct priority. - Change any hard coded priority levels that I found to use symbolic constants (TTIPRI and TTOPRI). - Remove the curpriority global variable and use that of curproc. This was used to detect when a process' priority had lowered and it should yield. We now effectively yield on every interrupt. - Activate propogate_priority(). It should now have the desired effect without needing to also propogate the scheduling class. - Temporarily comment out the call to vm_page_zero_idle() in the idle loop. It interfered with propogate_priority() because the idle process needed to do a non-blocking acquire of Giant and then other processes would try to propogate their priority onto it. The idle process should not do anything except idle. vm_page_zero_idle() will return in the form of an idle priority kernel thread which is woken up at apprioriate times by the vm system. - Update struct kinfo_proc to the new priority interface. Deliberately change its size by adjusting the spare fields. It remained the same size, but the layout has changed, so userland processes that use it would parse the data incorrectly. The size constraint should really be changed to an arbitrary version number. Also add a debug.sizeof sysctl node for struct kinfo_proc.
# 72276	10-Feb-2001	jhb	- Make astpending and need_resched process attributes rather than CPU attributes. This is needed for AST's to be properly posted in a preemptive kernel. They are backed by two new flags in p_sflag: PS_ASTPENDING and PS_NEEDRESCHED. They are still accesssed by their old macros: aston(), astoff(), etc. For completeness, an astpending() macro has been added to check for a pending AST, and clear_resched() has been added to clear need_resched(). - Rename syscall2() on the x86 back to syscall() to be consistent with other architectures.
# 72237	09-Feb-2001	jhb	- Move struct ithd to sys/interrupt.h. - Add a set of MI helper functions for interrupt threads: - ithread_create() creates a new interrupt thread - ithread_destroy() destroys an interrupt thread - ithread_add_handler() attaches a new handler to an interrupt thread - ithread_remove_handler() detaches a handler from an interrupt thread - Rename sinthand_add() and sched_swi() to swi_add() and swi_sched() respectively so that they live in a consistent namespace. - struct intrhand is no longer a public type. It would be private to kern_intr.c but the current implementation of fast interrupts on the alpha requires the type to be exported. However, all handlers should be treated as void * cookies in the way that new-bus treats them. This includes references to software interrupt handlers.
# 72200	09-Feb-2001	bmilekic	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
# 71858	31-Jan-2001	peter	Zap last remaining references to (and a use use of) of simple_locks.
# 71815	29-Jan-2001	jhb	- Use the right name for the proctree lock in the locking key. - Add a note about the special locking semantics used for members such as p_cred that are read by multiple processes but only written to by the current process. - Change p_upages_obj's locking key to note that it is created at fork and left alone afterwards (the actual pointer, not what it points to.) - Mark p_intr_nesting_level as being implicitly locked since only curproc accesses it. Reviewed by: jake
# 71696	26-Jan-2001	jhb	Fix fork_exit() to take a pointer to a function that returns void as its first argument rather than a function that returns a void *. Noticed by: jake
# 71604	24-Jan-2001	jhb	- Change fork_exit() to take a pointer to a trapframe as its 3rd argument instead of a trapframe directly. (Requested by bde.) - Convert the alpha switch_trampoline to call fork_exit() and use the MI fork_return() instead of child_return(). - Axe child_return().
# 71520	24-Jan-2001	jhb	- Split p_flag up into two fields. p_flag keeps most of the previous flags and is protected by the proc lock. p_sflag is protected by sched_lock and holds the following flags: PS_INMEM, PS_OWEUPC, PS_PROFIL, PS_SINTR, PS_TIMEOUT, PS_ALRMPEND, PS_PROFPEND, PS_CVWAITQ, PS_SWAPINREQ, and PS_SWAPPING. - p_klist is definitely locked now by the proc lock. - p_runtime, p_[usi]u are locked by sched_lock. - Add a new P_KTHREAD flag set for kernel threads created via kthread_create(9). - STOPEVENT() only needs the proc lock, it does not need Giant. - faultin() already checks PS_INMEM, so simplify the check in PHOLD() so that we only need to grab the proc lock and let faultin() perform the PS_INMEM check. - Add a prototype for zpfind(). - Add prototypes for the new fork_exit() and fork_return() MI functions that manage the fork return path. - Add a prototype for the MD function userret() so that it can be called from fork_return(). - Add needed include of <machine/frame.h> in the kernel.
# 71337	21-Jan-2001	jake	Make intr_nesting_level per-process, rather than per-cpu. Setup interrupt threads to run with it always >= 1, so that malloc can detect M_WAITOK from "interrupt" context. This is also necessary in order to context switch from sched_ithd() directly. Reviewed By: peter
# 71318	21-Jan-2001	jake	Remove the per-cpu pages used for copy and zero-ing pages of memory for SMP; just use the same ones as UP. These weren't used without holding Giant anyway, and the routines that use them would have to be protected from pre-emption to avoid migrating cpus.
# 71088	15-Jan-2001	jasone	Implement condition variables.
# 70317	23-Dec-2000	jake	Protect proc.p_pptr and proc.p_children/p_sibling with the proctree_lock. linprocfs not locked pending response from informal maintainer. Reviewed by: jhb, -smp@
# 70037	14-Dec-2000	jhb	Locking change: lock p_cred with the proc mutex (c) instead of assuming that it doesn't change after process creation.
# 70036	14-Dec-2000	jhb	Whitespace cleanups, and a few comment changes to get everything to fit back in 80 columns again after the locking descriptions were added. Submitted by: bde
# 69947	12-Dec-2000	jake	- Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead of explicit calls to lockmgr. Also provides macros for the flags pased to specify shared, exclusive or release which map to the lockmgr flags. This is so that the use of lockmgr can be easily replaced with optimized reader-writer locks. - Add some locking that I missed the first time.
# 69735	07-Dec-2000	jhb	Add comments to the proc structure to describe how each member will be locked. This list is subject to change, but hopefully many changes will not have to be made.
# 69654	06-Dec-2000	jhb	- Add in PROC_LOCK() and PROC_UNLOCK() macros. For now these do simple mutex operations. In the future they may call functions that verify correct locking order between processes in the INVARIANTS case. - Lock the process in PHOLD() and PREL().
# 69636	05-Dec-2000	jhb	1. Several style cleanups: - Whitespace fixes. - Comment fixes (mostly capitalization and trailing periods). - Reorderings to put variables all in the same place, function prototypes sorted alphabetically, etc. 2. Removed unused and #ifdef'd out members from struct ithd. 3. Changed ESTCPULIM() to use an explicit (PRIO_MAX - PRIO_MIN) rather than PRIO_TOTAL. 4. Remove <sys/lock.h> #include and resort #include's. Submitted by: bde (1, 3, 4)
# 69542	03-Dec-2000	jhb	Fix up a whitespace glitch in PHOLD() and fix it to use do { ... } while(0) instead of { ... }.
# 69537	02-Dec-2000	jhb	- Add a mutex to the proc structure p_mtx that will be used to lock accesses to each individual proc. - Initialize the lock during fork1(), and destroy it in wait1().
# 69512	02-Dec-2000	jake	Remove thr_sleep and thr_wakeup. Remove fields p_nthread and p_wakeup from struct proc, which are now unused (p_nthread already was). Remove process flag P_KTHREADP which was untested and only set in vfs_aio.c (it should use kthread_create). Move the yield system call to kern_synch.c as kern_threads.c has been removed completely. moral support from: alfred, jhb
# 69379	30-Nov-2000	marcel	Don't use p->p_sigstk.ss_flags to keep state of whether the process is on the alternate stack or not. For compatibility with sigstack(2) state is being updated if such is needed. We now determine whether the process is on the alternate stack by looking at its stack pointer. This allows a process to siglongjmp from a signal handler on the alternate stack to the place of the sigsetjmp on the normal stack. When maintaining state, this would have invalidated the state information and causing a subsequent signal to be delivered on the normal stack instead of the alternate stack. PR: 22286
# 69367	29-Nov-2000	jhb	- Add a p_mtxname field to proc which points to the description of the mutex a process is blocked on or NULL. - Add a corresponding field e_mtxname to eproc. - Fix a spelling nit in a comment.
# 69286	27-Nov-2000	jake	Use callout_reset instead of timeout(9). Most callouts are statically allocated, 2 have been added to struct proc for setitimer and sleep. Reviewed by: jhb, jlemon
# 69022	22-Nov-2000	jake	Protect the following with a lockmgr lock: allproc zombproc pidhashtbl proc.p_list proc.p_hash nextpid Reviewed by: jhb Obtained from: BSD/OS and netbsd
# 68862	17-Nov-2000	jake	- Split the run queue and sleep queue linkage, so that a process may block on a mutex while on the sleep queue without corrupting it. - Move dropping of Giant to after the acquire of sched_lock. Tested by: John Hay <jhay@icomtek.csir.co.za> jhb
# 67893	29-Oct-2000	phk	Move suser() and suser_xxx() prototypes and a related #define from <sys/proc.h> to <sys/systm.h>. Correctly document the #includes needed in the manpage. Add one now needed #include of <sys/systm.h>. Remove the consequent 48 unused #includes of <sys/proc.h>.
# 67694	27-Oct-2000	bde	Declare or #define per-cpu globals in <machine/globals.h> in all cases. The i386 UP case was messily different.
# 67551	25-Oct-2000	jhb	- Overhaul the software interrupt code to use interrupt threads for each type of software interrupt. Roughly, what used to be a bit in spending now maps to a swi thread. Each thread can have multiple handlers, just like a hardware interrupt thread. - Instead of using a bitmask of pending interrupts, we schedule the specific software interrupt thread to run, so spending, NSWI, and the shandlers array are no longer needed. We can now have an arbitrary number of software interrupt threads. When you register a software interrupt thread via sinthand_add(), you get back a struct intrhand that you pass to sched_swi() when you wish to schedule your swi thread to run. - Convert the name of 'struct intrec' to 'struct intrhand' as it is a bit more intuitive. Also, prefix all the members of struct intrhand with 'ih_'. - Make swi_net() a MI function since there is now no point in it being MD. Submitted by: cp
# 67321	19-Oct-2000	jhb	- Move the prototype for proc_reparent from sys/ptrace.h to sys/proc.h - Fix the sorting of function prototypes in sys/proc.h
# 67308	19-Oct-2000	jhb	Axe the idle_event eventhandler, and add a MD cpu_idle function used for things such as halting CPU's, idling CPU's, etc. Discussed with: msmith
# 67014	12-Oct-2000	bde	Moved the declaration of astpending to the correct place. This shouldn't affect the alpha or ia64, since they don't have a variable named astpending. The alpha still has 2 declarations of this nonexistent variable.
# 66754	06-Oct-2000	jhb	Argh, make P_ALRMPEND and P_PROFPEND be different flags. Submitted by: bde Pointy hat to: jhb
# 66716	06-Oct-2000	jhb	- Change fast interrupts on x86 to push a full interrupt frame and to return through doreti to handle ast's. This is necessary for the clock interrupts to work properly. - Change the clock interrupts on the x86 to be fast instead of threaded. This is needed because both hardclock() and statclock() need to run in the context of the current process, not in a separate thread context. - Kill the prevproc hack as it is no longer needed. - We really need Giant when we call psignal(), but we don't want to block during the clock interrupt. Instead, use two p_flag's in the proc struct to mark the current process as having a pending SIGVTALRM or a SIGPROF and let them be delivered during ast() when hardclock() has finished running. - Remove CLKF_BASEPRI, which was #ifdef'd out on the x86 anyways. It was broken on the x86 if it was turned on since cpl is gone. It's only use was to bogusly run softclock() directly during hardclock() rather than scheduling an SWI. - Remove the COM_LOCK simplelock and replace it with a clock_lock spin mutex. Since the spin mutex already handles disabling/restoring interrupts appropriately, this also lets us axe all the *_intr() fu. - Back out the hacks in the APIC_IO x86 cpu_initclocks() code to use temporary fast interrupts for the APIC trial. - Add two new process flags P_ALRMPEND and P_PROFPEND to mark the pending signals in hardclock() that are to be delivered in ast(). Submitted by: jakeb (making statclock safe in a fast interrupt) Submitted by: cp (concept of delaying signals until ast())
# 66698	05-Oct-2000	jhb	- Heavyweight interrupt threads on the alpha for device I/O interrupts. - Make softinterrupts (SWI's) almost completely MI, and divorce them completely from the x86 hardware interrupt code. - The ihandlers array is now gone. Instead, there is a MI shandlers array that just contains SWI handlers. - Most of the former machine/ipl.h files have moved to a new sys/ipl.h. - Stub out all the spl*() functions on all architectures. Submitted by: dfr
# 65904	15-Sep-2000	jhb	- Add a new process flag P_NOLOAD that marks a process that should be ignored during load average calcuations. - Set this flag for the idle processes and the softinterrupt process.
# 65822	13-Sep-2000	jhb	- Remove the inthand2_t type and use the equivalent driver_intr_t type from newbus for referencing device interrupt handlers. - Move the 'struct intrec' type which describes interrupt sources into sys/interrupt.h instead of making it just be a x86 structure. - Don't create 'ithd' and 'intrec' typedefs, instead, just use 'struct ithd' and 'struct intrec' - Move the code to translate new-bus interrupt flags into an interrupt thread priority out of the x86 nexus code and into a MI ithread_priority() function in sys/kern/kern_intr.c. - Remove now-uneeded x86-specific headers from sys/dev/ata/ata-all.c and sys/pci/pci_compat.c.
# 65557	06-Sep-2000	jasone	Major update to the way synchronization is done in the kernel. Highlights include: * Mutual exclusion is used instead of spl(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.) Per-CPU idle processes. * Interrupts are run in their own separate kernel threads and can be preempted (i386 only). Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
# 65495	05-Sep-2000	truckman	Remove uidinfo hash table lookup and maintenance out of chgproccnt() and chgsbsize(), which are called rather frequently and may be called from an interrupt context in the case of chgsbsize(). Instead, do the hash table lookup and maintenance when credentials are changed, which is a lot less frequent. Add pointers to the uidinfo structures to the ucred and pcred structures for fast access. Pass a pointer to the credential to chgproccnt() and chgsbsize() instead of passing the uid. Add a reference count to the uidinfo structure and use it to decide when to free the structure rather than freeing the structure when the resource consumption drops to zero. Move the resource tracking code from kern_proc.c to kern_resource.c. Move some duplicate code sequences in kern_prot.c to separate helper functions. Change KASSERTs in this code to unconditional tests and calls to panic().
# 65237	30-Aug-2000	rwatson	o Centralize inter-process access control, introducing: int p_can(p1, p2, operation, privused) which allows specification of subject process, object process, inter-process operation, and an optional call-by-reference privused flag, allowing the caller to determine if privilege was required for the call to succeed. This allows jail, kern.ps_showallprocs and regular credential-based interaction checks to occur in one block of code. Possible operations are P_CAN_SEE, P_CAN_SCHED, P_CAN_KILL, and P_CAN_DEBUG. p_can currently breaks out as a wrapper to a series of static function checks in kern_prot, which should not be invoked directly. o Commented out capabilities entries are included for some checks. o Update most inter-process authorization to make use of p_can() instead of manual checks, PRISON_CHECK(), P_TRESPASS(), and kern.ps_showallprocs. o Modify suser{,_xxx} to use const arguments, as it no longer modifies process flags due to the disabling of ASU. o Modify some checks/errors in procfs so that ENOENT is returned instead of ESRCH, further improving concealment of processes that should not be visible to other processes. Also introduce new access checks to improve hiding of processes for procfs_lookup(), procfs_getattr(), procfs_readdir(). Correct a bug reported by bp concerning not handling the CREATE case in procfs_lookup(). Remove volatile flag in procfs that caused apparently spurious qualifier warnigns (approved by bde). o Add comment noting that ktrace() has not been updated, as its access control checks are different from ptrace(), whereas they should probably be the same. Further discussion should happen on this topic. Reviewed by: bde, green, phk, freebsd-security, others Approved by: bde Obtained from: TrustedBSD Project
# 65198	29-Aug-2000	green	Remove any possibility of hiwat-related race conditions by changing the chgsbsize() call to use a "subject" pointer (&sb.sb_hiwat) and a u_long target to set it to. The whole thing is splnet(). This fixes a problem that jdp has been able to provoke.
# 62976	11-Jul-2000	mckusick	Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).
# 61976	22-Jun-2000	alfred	fix races in the uidinfo subsystem, several problems existed: 1) while allocating a uidinfo struct malloc is called with M_WAITOK, it's possible that while asleep another process by the same user could have woken up earlier and inserted an entry into the uid hash table. Having redundant entries causes inconsistancies that we can't handle. fix: do a non-waiting malloc, and if that fails then do a blocking malloc, after waking up check that no one else has inserted an entry for us already. 2) Because many checks for sbsize were done as "test then set" in a non atomic manner it was possible to exceed the limits put up via races. fix: instead of querying the count then setting, we just attempt to set the count and leave it up to the function to return success or failure. 3) The uidinfo code was inlining and repeating, lookups and insertions and deletions needed to be in their own functions for clarity. Reviewed by: green
# 60938	26-May-2000	jake	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
# 60833	23-May-2000	jake	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
# 59792	30-Apr-2000	green	Change the scheduler to actually respect the PUSER barrier. It's been wrong for many years that negative niceness would lower the priority of a process below PUSER, and once below PUSER, there were conditionals in the code that are required to test for whether a process was in the kernel which would break. The breakage could (and did) cause lock-ups, basically nothing else but the least nice program being able to run in some conditions. The algorithm which adjusts the priority now subtracts PRIO_MIN to do things properly, and the ESTCPULIM() algorithm was updated to use PRIO_TOTAL (PRIO_MAX - PRIO_MIN) to calculate the estcpu. NICE_WEIGHT is now 1 to accomodate the full range of priorities better (a -20 process with full CPU time has the priority of a +0 process with no CPU time). There are now 20 queues (exactly; 80 priorities) for use in user processes' scheduling, and PUSER has been lowered to 48 to accomplish this. This means, to the user, that things will be scheduled more correctly (noticeable), there is no lock-up anymore WRT a niced -20 process never releasing the CPU time for other processes. In this fair system, tsleep()ed < PUSER processes now will get the proper higher priority than priority >= PUSER user processes. The detective work of this was done by me, along with part of the solution. Luoqi Chen has provided most of the solution, and really helped me understand what was happening better, to boot :) Submitted by: luoqi Concept reviewed by: bde
# 59288	16-Apr-2000	jlemon	Introduce kqueue() and kevent(), a kernel event notification facility.
# 58755	28-Mar-2000	dillon	The SMP cleanup commit broke UP compiles. Make UP compiles work again.
# 58717	28-Mar-2000	dillon	Commit major SMP cleanups and move the BGL (big giant lock) in the syscall path inward. A system call may select whether it needs the MP lock or not (the default being that it does need it). A great deal of conditional SMP code for various deadended experiments has been removed. 'cil' and 'cml' have been removed entirely, and the locking around the cpl has been removed. The conditional separately-locked fast-interrupt code has been removed, meaning that interrupts must hold the CPL now (but they pretty much had to anyway). Another reason for doing this is that the original separate-lock for interrupts just doesn't apply to the interrupt thread mechanism being contemplated. Modifications to the cpl may now ONLY occur while holding the MP lock. For example, if an otherwise MP safe syscall needs to mess with the cpl, it must hold the MP lock for the duration and must (as usual) save/restore the cpl in a nested fashion. This is precursor work for the real meat coming later: avoiding having to hold the MP lock for common syscalls and I/O's and interrupt threads. It is expected that the spl mechanisms and new interrupt threading mechanisms will be able to run in tandem, allowing a slow piecemeal transition to occur. This patch should result in a moderate performance improvement due to the considerable amount of code that has been removed from the critical path, especially the simplification of the spl*() calls. The real performance gains will come later. Approved by: jkh Reviewed by: current, bde (exception.s) Some work taken from: luoqi's patch
# 56766	28-Jan-2000	green	Fix a bug that could crash the system if you press ^T while a slower system is slowed down and in the right spot (a race condition in fork()). The "previous time" fields have moved from pstat to proc. Anything which uses KVM needs to be recompiled with a new libkvm/headers. A couple wacky u_quad_t's in struct proc are now u_int64_t (the same, but according to lack of 'quad's in proc.h and usage in kern_resource.c). This will have no effect on code. This has been make-world-and-installed-new-kernel-which-works-fine-tested. Reviewed by: bde (previous version)
# 55205	29-Dec-1999	peter	Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL" is an application space macro and the applications are supposed to be free to use it as they please (but cannot). This is consistant with the other BSD's who made this change quite some time ago. More commits to come.
# 54471	12-Dec-1999	green	Move the wakeup_one() prototype from proc.h to systm.h. It now hangs out with it's sibling, wakeup().
# 54188	06-Dec-1999	luoqi	User ldt sharing.
# 53823	28-Nov-1999	bde	Scheduler fixes equivalent to the ones logged in the following NetBSD commit to kern_synch.c: ---------------------------- revision 1.55 date: 1999/02/23 02:56:03; author: ross; state: Exp; lines: +39 -10 Scheduler bug fixes and reorganization * fix the ancient nice(1) bug, where nice +20 processes incorrectly steal 10 - 20% of the CPU, (or even more depending on load average) * provide a new schedclk() mechanism at a new clock at schedhz, so high platform hz values don't cause nice +0 processes to look like they are niced * change the algorithm slightly, and reorganize the code a lot * fix percent-CPU calculation bugs, and eliminate some no-op code === nice bug === Correctly divide the scheduler queues between niced and compute-bound processes. The current nice weight of two (sort of, see `algorithm change' below) neatly divides the USRPRI queues in half; this should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op, and it was done after decay_cpu() which can only _reduce_ the value. It has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can scheduler-penalize themselves onto the same queue as nice +20 processes. (Or even a higher one.) === New schedclk() mechansism === Some platforms should be cutting down stathz before hitting the scheduler, since the scheduler algorithm only works right in the vicinity of 64 Hz. Rather than prescale hz, then scale back and forth by 4 every time p_estcpu is touched (each occurance an abstraction violation), use p_estcpu without scaling and require schedhz to be generated directly at the right frequency. Use a default stathz (well, actually, profhz) / 4, so nothing changes unless a platform defines schedhz and a new clock. Define these for alpha, where hz==1024, and nice was totally broke. === Algorithm change === The nice value used to be added to the exponentially-decayed scheduler history value p_estcpu, in _addition_ to be incorporated directly (with greater wieght) into the priority calculation. At first glance, it appears to be a pointless increase of 1/8 the nice effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that because it will ramp up linearly but be decayed only exponentially, thus converging to an additional .75 nice for a loadaverage of one. I killed this, it makes the behavior hard to control, almost impossible to analyze, and the effect (~~nothing at for the first second, then somewhat increased niceness after three seconds or more, depending on load average) pointless. === Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation. Collect scheduler functionality. Try to put each abstraction in just one place. ---------------------------- The details are a little different in FreeBSD: === nice bug === Fixing this is the main point of this commit. We use essentially the same clipping rule as NetBSD (our limit on p_estcpu differs by a scale factor). However, clipping at all is fundamentally bad. It gives free CPU the hoggiest hogs once they reach the limit, and reaching the limit is normal for long-running hogs. This will be fixed later. === New schedclk() mechanism === We don't use the NetBSD schedclk() (now schedclock()) mechanism. We require (real)stathz to be about 128 and scale by an extra factor of 2 compared with NetBSD's statclock(). We scale p_estcpu instead of scaling the clock. This is more accurate and flexible. === Algorithm change === Same change. === Other bugs === The p_pctcpu bug was fixed long ago. We don't try as hard to abstract functionality yet. Related changes: the new limit on p_estcpu must be exported to kern_exit.c for clipping in wait1(). Agreed with by: dufault
# 53745	27-Nov-1999	bde	Moved scheduling-related code to kern_synch.c so that it is easier to fix and extend. The new function containing the code is named schedclock() as in NetBSD, but it has slightly different semantics (it already handles incrementation of p->p_cpticks, and it should handle any calling frequency). Agreed with in principle by: dufault
# 53709	26-Nov-1999	phk	Add a sysctl to control if argv is disclosed to the world: kern.ps_argsopen It defaults to 1 which means that all users can see all argvs in ps(1). Reviewed by: Warner
# 53518	21-Nov-1999	phk	Introduce the new function p_trespass(struct proc p1, struct proc p2) which returns zero or an errno depending on the legality of p1 trespassing on p2. Replace kern_sig.c:CANSIGNAL() with call to p_trespass() and one extra signal related check. Replace procfs.h:CHECKIO() macros with calls to p_trespass(). Only show command lines to process which can trespass on the target process.
# 53239	16-Nov-1999	phk	Introduce commandline caching in the kernel. This fixes some nasty procfs problems for SMP, makes ps(1) run much faster, and makes ps(1) even less dependent on /proc which will aid chroot and jails alike. To disable this facility and revert to previous behaviour: sysctl -w kern.ps_arg_cache_limit=0 For full details see the current@FreeBSD.org mail-archives.
# 52140	11-Oct-1999	luoqi	Add a per-signal flag to mark handlers registered with osigaction, so we can provide the correct context to each signal handler. Fix broken sigsuspend(): don't use p_oldsigmask as a flag, use SAS_OLDMASK as we did before the linuxthreads support merge (submitted by bde). Move ps_sigstk from to p_sigacts to the main proc structure since signal stack should not be shared among threads. Move SAS_OLDMASK and SAS_ALTSTACK flags from sigacts::ps_flags to proc::p_flag. Move PS_NOCLDSTOP and PS_NOCLDWAIT flags from proc::p_flag to procsig::ps_flag. Reviewed by: marcel, jdp, bde
# 52070	09-Oct-1999	green	Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total usage limit.
# 51791	29-Sep-1999	marcel	sigset_t change (part 2 of 5) ----------------------------- The core of the signalling code has been rewritten to operate on the new sigset_t. No methodological changes have been made. Most references to a sigset_t object are through macros (see signalvar.h) to create a level of abstraction and to provide a basis for further improvements. The NSIG constant has not been changed to reflect the maximum number of signals possible. The reason is that it breaks programs (especially shells) which assume that all signals have a non-null name in sys_signame. See src/bin/sh/trap.c for an example. Instead _SIG_MAXSIG has been introduced to hold the maximum signal possible with the new sigset_t. struct sigprop has been moved from signalvar.h to kern_sig.c because a) it is only used there, and b) access must be done though function sigprop(). The latter because the table doesn't holds properties for all signals, but only for the first NSIG signals. signal.h has been reorganized to make reading easier and to add the new and/or modified structures. The "old" structures are moved to signalvar.h to prevent namespace polution. Especially the coda filesystem suffers from the change, because it contained lines like (p->p_sigmask == SIGIO), which is easy to do for integral types, but not for compound types. NOTE: kdump (and port linux_kdump) must be recompiled. Thanks to Garrett Wollman and Daniel Eischen for pressing the importance of changing sigreturn as well.
# 50477	27-Aug-1999	peter	$Id$ -> $FreeBSD$
# 50030	18-Aug-1999	peter	Update for MI switch components. struct prochd is replaced by TAILQ's. Use a spare pad field for saving the run queue index.
# 48544	03-Jul-1999	mckusick	The buffer queue mechanism has been reformulated. Instead of having QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN, QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean and dirty buffers have been separated. Empty buffers with KVM assignments have been separated from truely empty buffers. getnewbuf() has been rewritten and now operates in a 100% optimal fashion. That is, it is able to find precisely the right kind of buffer it needs to allocate a new buffer, defragment KVM, or to free-up an existing buffer when the buffer cache is full (which is a steady-state situation for the buffer cache). Buffer flushing has been reorganized. Previously buffers were flushed in the context of whatever process hit the conditions forcing buffer flushing to occur. This resulted in processes blocking on conditions unrelated to what they were doing. This also resulted in inappropriate VFS stacking chains due to multiple processes getting stuck trying to flush dirty buffers or due to a single process getting into a situation where it might attempt to flush buffers recursively - a situation that was only partially fixed in prior commits. We have added a new daemon called the buf_daemon which is responsible for flushing dirty buffers when the number of dirty buffers exceeds the vfs.hidirtybuffers limit. This daemon attempts to dynamically adjust the rate at which dirty buffers are flushed such that getnewbuf() calls (almost) never block. The number of nbufs and amount of buffer space is now scaled past the 8MB limit that was previously imposed for systems with over 64MB of memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed somewhat. The number of physical buffers has been increased with the intention that we will manage physical I/O differently in the future. reassignbuf previously attempted to keep the dirtyblkhd list sorted which could result in non-deterministic operation under certain conditions, such as when a large number of dirty buffers are being managed. This algorithm has been changed. reassignbuf now keeps buffers locally sorted if it can do so cheaply, and otherwise gives up and adds buffers to the head of the dirtyblkhd list. The new algorithm is deterministic but not perfect. The new algorithm greatly reduces problems that previously occured when write_behind was turned off in the system. The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive P_BUFEXHAUST bit. This bit allows processes working with filesystem buffers to use available emergency reserves. Normal processes do not set this bit and are not allowed to dig into emergency reserves. The purpose of this bit is to avoid low-memory deadlocks. A small race condition was fixed in getpbuf() in vm/vm_pager.c. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
# 48391	01-Jul-1999	peter	Slight reorganization of kernel thread/process creation. Instead of using SYSINIT_KT() etc (which is a static, compile-time procedure), use a NetBSD-style kthread_create() interface. kproc_start is still available as a SYSINIT() hook. This allowed simplification of chunks of the sysinit code in the process. This kthread_create() is our old kproc_start internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work the same as the NetBSD one. One thing I'd like to do shortly is get rid of nfsiod as a user initiated process. It makes sense for the nfs client code to create them on the fly as needed up to a user settable limit. This means that nfsiod doesn't need to be in /sbin and is always "available". This is a fair bit easier to do outside of the SYSINIT_KT() framework.
# 48377	30-Jun-1999	peter	Slight tweak to fork1() calling conventions. Add a third argument so the caller can easily find the child proc struct. fork(), rfork() etc syscalls set p->p_retval[] themselves. Simplify the SYSINIT_KT() code and other kernel thread creators to not need to use pfind() to find the child based on the pid. While here, partly tidy up some of the fork1() code for RF_SIGSHARE etc.
# 48304	28-Jun-1999	peter	Move struct prochd out of #ifdef KERNEL so the Alpha genassym can get at it.
# 46155	28-Apr-1999	phk	This Implements the mumbled about "Jail" feature. This is a seriously beefed up chroot kind of thing. The process is jailed along the same lines as a chroot does it, but with additional tough restrictions imposed on what the superuser can do. For all I know, it is safe to hand over the root bit inside a prison to the customer living in that prison, this is what it was developed for in fact: "real virtual servers". Each prison has an ip number associated with it, which all IP communications will be coerced to use and each prison has its own hostname. Needless to say, you need more RAM this way, but the advantage is that each customer can run their own particular version of apache and not stomp on the toes of their neighbors. It generally does what one would expect, but setting up a jail still takes a little knowledge. A few notes: I have no scripts for setting up a jail, don't ask me for them. The IP number should be an alias on one of the interfaces. mount a /proc in each jail, it will make ps more useable. /proc/<pid>/status tells the hostname of the prison for jailed processes. Quotas are only sensible if you have a mountpoint per prison. There are no privisions for stopping resource-hogging. Some "#ifdef INET" and similar may be missing (send patches!) If somebody wants to take it from here and develop it into more of a "virtual machine" they should be most welcome! Tools, comments, patches & documentation most welcome. Have fun... Sponsored by: http://www.rndassociates.com/ Run for almost a year by: http://www.servetheweb.com/
# 46129	27-Apr-1999	luoqi	Enable vmspace sharing on SMP. Major changes are, - %fs register is added to trapframe and saved/restored upon kernel entry/exit. - Per-cpu pages are no longer mapped at the same virtual address. - Each cpu now has a separate gdt selector table. A new segment selector is added to point to per-cpu pages, per-cpu global variables are now accessed through this new selector (%fs). The selectors in gdt table are rearranged for cache line optimization. - fask_vfork is now on as default for both UP and SMP. - Some aio code cleanup. Reviewed by: Alan Cox <alc@cs.rice.edu> John Dyson <dyson@iquest.net> Julian Elischer <julian@whistel.com> Bruce Evans <bde@zeta.org.au> David Greenman <dg@root.com>
# 46112	27-Apr-1999	phk	Suser() simplification: 1: s/suser/suser_xxx/ 2: Add new function: suser(struct proc ), prototyped in <sys/proc.h>. 3: s/suser_xxx($[a-zA-Z0-9_]$->p_ucred, \&\1->p_acflag)/suser(\1)/ The remaining suser_xxx() calls will be scrutinized and dealt with later. There may be some unneeded #include <sys/cred.h>, but they are left as an exercise for Bruce. More changes to the suser() API will come along with the "jail" code.
# 45959	23-Apr-1999	dt	Moved cpu_set_fork_handler's prototype from <machine/cpu.h> to <sys/proc.h>. Suggested by: bde
# 45665	13-Apr-1999	peter	Move the declaration of faultin() from the vm headers to proc.h, since it is now referenced from a macro there (PHOLD()).
# 45368	06-Apr-1999	peter	Remove (but leave place markers) P_NOSWAP and P_PHYSIO - they were only used for preventing swapouts of the UPAGES and there is another mechanism for that (PHOLD/PRELE using p->p_lock).
# 44679	12-Mar-1999	julian	Reviewed by: Many at differnt times in differnt parts, including alan, john, me, luoqi, and kirk Submitted by: Matt Dillon <dillon@frebsd.org> This change implements a relatively sophisticated fix to getnewbuf(). There were two problems with getnewbuf(). First, the writerecursion can lead to a system stack overflow when you have NFS and/or VN devices in the system. Second, the free/dirty buffer accounting was completely broken. Not only did the nfs routines blow it trying to manually account for the buffer state, but the accounting that was done did not work well with the purpose of their existance: figuring out when getnewbuf() needs to sleep. The meat of the change is to kern/vfs_bio.c. The remaining diffs are all minor except for NFS, which includes both the fixes for bp interaction AND fixes for a 'biodone(): buffer already done' lockup. Sys/buf.h also contains a chaining structure which is not used by this patchset but is used by other patches that are coming soon. This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF. (sorry for the missing T matt)
# 44487	05-Mar-1999	bde	The magic "no-cpu" cpu number is 0xff. Don't misrepresent cpu numbers as chars or use bogus casts in an attempt to unmisrepresnt them. In top, don't assume that 0xff is the only negative cpu number when cpu numbers are (mis)represented.
# 44452	03-Mar-1999	julian	The tunable parameter for the scheduler quantum was inverted. Higher numbers led to smaller quanta. In discussion with BDE, change this parameter to be in uSecs to make it machine independent, and limit it to non zero multiples of 'tick' (rounding down). Also make the variabel globally available so that the present function that returns its value (used for posix scheduling I believe) can go away. Submitted by: Bruce Evans <bde@freebsd.org>
# 44327	28-Feb-1999	bde	Removed all traces of `p_switchtime'. The relevant timestamp is per-cpu, not per-process. Keep it in `switchtime' consistently. It is now clear that the timestamp is always valid in fork_trampoline() except when the child is running on a previously idle cpu, which can only happen if there are multiple cpus, so don't check or set the timestamp in fork_trampoline except in the (i386) SMP case. Just remove the alpha code for setting it unconditionally, since there is no SMP case for alpha and the code had rotted. Parts reviewed by: dfr, phk
# 44269	25-Feb-1999	newton	Added p_emuldata to struct proc as a place for emulators to hook process-specific state information. Submitted by: Guido van Rooij <guido@gvr.org>
# 44218	22-Feb-1999	bde	Improved scheduling in uiomove(), etc. resched_wanted() is true too often for it to be a good criterion for switching kernel cpu hogs -- it is true after most wakeups. Use the criterion "has been running for >= 2 quanta" instead.
# 43560	03-Feb-1999	bde	Removed some unused includes (only 1 to go here now). Sorted includes.
# 43497	01-Feb-1999	newton	Declare M_ZOMBIE so the svr4 emulator doesn't need to. Suggested by: bde
# 43208	26-Jan-1999	julian	Enable Linux threads support by default. This takes the conditionals out of the code that has been tested by various people for a while. ps and friends (libkvm) will need a recompile as some proc structure changes are made. Submitted by: "Richard Seaman, Jr." <dick@tar.com>
# 42379	07-Jan-1999	julian	Changes to the LINUX_THREADS support to only allocate extra memory for shared signal handling when there is shared signal handling being used. This removes the main objection to making the shared signal handling a standard ability in rfork() and friends and 'unconditionalising' this code. (i.e. the allocation of an extra 328 bytes per process). Signal handling information remains in the U area until such a time as it's reference count would be incremented to > 1. At that point a new struct is malloc'd and maintained in KVM so that it can be shared between the processes (threads) using it. A function to check the reference count and move the struct back to the U area when it drops back to 1 is also supplied. Signal information is therefore now swapable for all processes that are not sharing that information with other processes. THis should addres the concerns raised by Garrett and others. Submitted by: "Richard Seaman, Jr." <dick@tar.com>
# 42202	31-Dec-1998	bde	Removed garbage sloppy-common variable `pasleep'. Fixed other style bugs in previous commit.
# 41971	21-Dec-1998	dillon	Add asleep() and await() support. Currently highly experimental. A small support structure had to be added to the proc structure, and a few minor conditional panics no longer apply.
# 41931	19-Dec-1998	julian	Reviewed by: Luoqi Chen, Jordan Hubbard Submitted by: "Richard Seaman, Jr." <lists@tar.com> Obtained from: linux :-) Code to allow Linux Threads to run under FreeBSD. By default not enabled This code is dependent on the conditional COMPAT_LINUX_THREADS (suggested by Garret) This is not yet a 'real' option but will be within some number of hours.
# 41136	13-Nov-1998	dg	Increased PID_MAX to 99999. The main reason for doing this is to make the pid space somewhat more sparse which improves the performance of finding an unused pid on systems with large numbers of processes. The new value was chosen so that it doesn't overflow the 5 digit pid fields in various programs.
# 41087	11-Nov-1998	truckman	I got another batch of suggestions for cosmetic changes from bde.
# 41086	11-Nov-1998	truckman	Installed the second patch attached to kern/7899 with some changes suggested by bde, a few other tweaks to get the patch to apply cleanly again and some improvements to the comments. This change closes some fairly minor security holes associated with F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN had on tty devices. For more details, see the description on the PR. Because this patch increases the size of the proc and pgrp structures, it is necessary to re-install the includes and recompile libkvm, the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w. PR: kern/7899 Reviewed by: bde, elvind
# 41038	09-Nov-1998	truckman	If the session leader dies, s_leader is set to NULL and getsid() may dereference a NULL pointer, causing a panic. Instead of following s_leader to find the session id, store it in the session structure. Jukka found the following info: BTW - I just found what I have been looking for. Std 1003.1 Part 1: SYSTEM API [C LANGUAGE] section 2.2.2.80 states quite explicitly... Session lifetime: The period between when a session is created and the end of lifetime of all the process groups that remain as members of the session. So, this quite clearly tells that while there is any single process in any process group which is a member of the session, the session remains as an independent entity. Reviewed by: peter Submitted by: "Jukka A. Ukkonen" <jau@jau.tmt.tele.fi>
# 36441	28-May-1998	phk	Some cleanups related to timecounters and weird ifdefs in <sys/time.h>. Clean up (or if antipodic: down) some of the msgbuf stuff. Use an inline function rather than a macro for timecounter delta. Maintain process "on-cpu" time as 64 bits of microseconds to avoid needless second rollover overhead. Avoid calling microuptime the second time in mi_switch() if we do not pass through _idle in cpu_switch() This should reduce our context-switch overhead a bit, in particular on pre-P5 and SMP systems. WARNING: Programs which muck about with struct proc in userland will have to be fixed. Reviewed, but found imperfect by: bde
# 35029	04-Apr-1998	phk	Time changes mark 2: * Figure out UTC relative to boottime. Four new functions provide time relative to boottime. * move "runtime" into struct proc. This helps fix the calcru() problem in SMP. * kill mono_time. * add timespec{add\|sub\|cmp} macros to time.h. (XXX: These may change!) * nanosleep, select & poll takes long sleeps one day at a time Reviewed by: bde Tested by: ache and others
# 34924	28-Mar-1998	bde	Moved some #includes from <sys/param.h> nearer to where they are actually used.
# 34030	04-Mar-1998	dufault	Reviewed by: msmith, bde long ago POSIX.4 headers and sysctl variables. Nothing should change unless POSIX4 is defined or _POSIX_VERSION is set to 199309.
# 33680	20-Feb-1998	bde	Staticized. Don't depend on "implicit int".
# 32702	22-Jan-1998	dyson	VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
# 31891	20-Dec-1997	sef	Clear the p_stops field on change of user/group id, unless the correct flag is set in the p_pfsflags field. This, essentially, prevents an SUID proram from hanging after being traced. (E.g., "truss /usr/bin/rlogin" would fail, but leave rlogin in a stopevent state.) Yet another case where procctl is (hopefully ;)) no longer needed in the general case. Reviewed by: bde (thanks bruce :))
# 31675	12-Dec-1997	dyson	We have had support for running the kernel daemons as threads for quite a while, but forgot to do so. For now, this code supports most daemons running as kernel threads in UP kernels, and as full processes in SMP. We will soon be able to run them as threads in SMP, but not yet.
# 31564	06-Dec-1997	sef	Changes to allow event-based process monitoring and control.
# 31403	25-Nov-1997	julian	Shift a few SYSINT() calls around. this results in a few functions becoming static, and the SYSINITs being close to the code they are related to. setting up the dump device is with dumpsys() and kicking off the scheduler is with the scheduler. Mounting root is with the code that does it. Reviewed by: phk
# 31393	24-Nov-1997	bde	Removed all traces of P_IDLEPROC. It was tested but never set.
# 31333	21-Nov-1997	bde	Const poisoning from ks_shortdesc.
# 30994	06-Nov-1997	phk	Move the "retval" (3rd) parameter from all syscall functions and put it in struct proc instead. This fixes a boatload of compiler warning, and removes a lot of cruft from the sources. I have not removed the /ARGSUSED/, they will require some looking at. libkvm, ps and other userland struct proc frobbing programs will need recompiled.
# 30354	12-Oct-1997	phk	Last major round (Unless Bruce thinks of somthing :-) of malloc changes. Distribute all but the most fundamental malloc types. This time I also remembered the trick to making things static: Put "static" in front of them. A couple of finer points by: bde
# 29683	21-Sep-1997	gibbs	buf.h: Change the definition of a buffer queue so that bufqdisksort can properly deal with bordered writes. Add inline functions for accessing buffer queues. This should be considered an opaque data structure by clients. callout.h: New callout implementation. device.h: Add support for CAM interrupts. disk.h: disklabel.h: tqdisksort->bufqdisksort kernel.h: Add new configuration entries for configuration hooks and calling cpu_rootconf and cpu_dumpconf. param.h: Add a priority for sleeping waiting on config hooks. proc.h: Update for new callout implementation. queue.h: Add TAILQ_HEAD_INITIALIZER from NetBSD. systm.h: Add prototypes for cpu_root/dumpconf, splcam, splsoftcam, etc..
# 29340	13-Sep-1997	joerg	Implement SA_NOCLDWAIT. The implementation is done (unlike what i've originally been contemplating) by reparenting kids of processes that have the appropriate bit set to PID 1, and let PID 1 handle the zombie. This is far less problematical than what would seem to be ``doing it right'', for a number of reasons. Of our currently shipping PID-1-intended programs, 50 % fail the above assumption. ;-) (Read this: sysinstall doesn't do it right. This is no problem as long as no program called by sysinstall actually uses SA_NOCLDWAIT.) ToDo: . clarify the correct SA_* flag inheritance, compared to other systems, . decide whether the compat cruft (osigvec(9)) should deal with new system additions or not, . merge OpenBSD's SA_SIGINFO implementation. ;) Reviewed by: bde
# 27221	06-Jul-1997	dyson	This is an upgrade so that the kernel supports the AIO calls from POSIX.4. Additionally, there is some initial code that supports LIO. This code supports AIO/LIO for all types of file descriptors, with few if any restrictions. There will be a followup very soon that will support significantly more efficient operation for VCHR type files (raw.) This code is also dependent on some kernel features that don't work under SMP yet. After I commit the changes to the kernel to support proper address space sharing on SMP, this code will also work under SMP.
# 26812	22-Jun-1997	peter	Preliminary support for per-cpu data pages. This eliminates a lot of #ifdef SMP type code. Things like _curproc reside in a data page that is unique on each cpu, eliminating the expensive macros like: #define curproc (SMPcurproc[cpunumber()]) There are some unresolved bootstrap and address space sharing issues at present, but Steve is waiting on this for other work. There is still some strictly temporary code present that isn't exactly pretty. This is part of a larger change that has run into some bumps, this part is standalone so it should be safe. The temporary code goes away when the full idle cpu support is finished. Reviewed by: fsmp, dyson
# 26671	15-Jun-1997	dyson	Modifications to existing files to support the initial AIO/LIO and kernel based threading support.
# 26332	01-Jun-1997	peter	New field: p_sleepend; so that settime() can adjust the expected wakeup time for things like nanosleep. These sleep in terms of "ticks" and calculate the elapsed time relative to the expected wakeup time and do not return good results when the system time is adjusted.
# 25998	22-May-1997	phk	Remove p_selbits and p_selbits_size so sizeof(struct proc) returns to 256 bytes. Move p_hash to where it should have been all along, since we change the layout anyway. Recompile ps, top, libutil and all that. Taked about with: bde
# 25543	07-May-1997	peter	remove #include "opt_smp.h" declaration of SMPcurproc[] moved to here next to it's uniprocessor counterpart
# 25164	26-Apr-1997	peter	Man the liferafts! Here comes the long awaited SMP -> -current merge! There are various options documented in i386/conf/LINT, there is more to come over the next few days. The kernel should run pretty much "as before" without the options to activate SMP mode. There are a handful of known "loose ends" that need to be fixed, but have been put off since the SMP kernel is in a moderately good condition at the moment. This commit is the result of the tinkering and testing over the last 14 months by many people. A special thanks to Steve Passe for implementing the APIC code!
# 24697	07-Apr-1997	peter	Move p_vmspace into the startzero section since we've just changed things and may as well get it over and done with.
# 24691	07-Apr-1997	peter	The biggie: Get rid of the UPAGES from the top of the per-process address space. (!) Have each process use the kernel stack and pcb in the kvm space. Since the stacks are at a different address, we cannot copy the stack at fork() and allow the child to return up through the function call tree to return to user mode - create a new execution context and have the new process begin executing from cpu_switch() and go to user mode directly. In theory this should speed up fork a bit. Context switch the tss_esp0 pointer in the common tss. This is a lot simpler since than swithching the gdt[GPROC0_SEL].sd.sd_base pointer to each process's tss since the esp0 pointer is a 32 bit pointer, and the sd_base setting is split into three different bit sections at non-aligned boundaries and requires a lot of twiddling to reset. The 8K of memory at the top of the process space is now empty, and unmapped (and unmappable, it's higher than VM_MAXUSER_ADDRESS). Simplity the pmap code to manage process contexts, we no longer have to double map the UPAGES, this simplifies and should measuably speed up fork(). The following parts came from John Dyson: Set PG_G on the UPAGES that are now in kernel context, and invalidate them when swapping them out. Move the upages object (upobj) from the vmspace to the proc structure. Now that the UPAGES (pcb and kernel stack) are out of user space, make rfork(..RFMEM..) do what was intended by sharing the vmspace entirely via reference counting rather than simply inheriting the mappings.
# 23328	03-Mar-1997	ache	Use roundup(MAXLOGNAME, sizeof(long)) as e_login/s_login arrays size instead of all hardcoded assumptions historically used (i.e. sizeof(long) == 4) Use MAXLOGNAME == 17 for stricter setlogin() size checking. Since it rounds up to 20, all sizes remains the same
# 22975	22-Feb-1997	peter	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 21673	14-Jan-1997	jkh	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 18974	17-Oct-1996	dyson	Make processes waken up eligible for immediate swap-in.
# 18884	12-Oct-1996	bde	Moved declarations of tsleep() and wakeup() from proc.h to systm.h so that proc.h doesn't have to be included so often.
# 18277	13-Sep-1996	bde	Don't use __dead in the kernel. It was an obfuscation for gcc >= 2.5 and a no-op for gcc >= 2.6.
# 17702	20-Aug-1996	smpatel	Remove the kernel FD_SETSIZE limit for select(). Make select()'s first argument 'int' not 'u_int'. Reviewed by: bde
# 17366	31-Jul-1996	dg	Converted timer/run queues to 4.4BSD queue style. Removed old and unused sleep(). Implemented wakeup_one() which may be used in the future to combat the "thundering herd" problem for some special cases. Reviewed by: dyson
# 16256	09-Jun-1996	alex	Include <sys/param.h> needed for MAXCOMLEN and MAXLOGNAME constants.
# 15495	01-May-1996	bde	Removed prototype for obsolete function sleep().
# 15113	07-Apr-1996	bde	systm.h: Moved declaration of bootverbose to a better place. It isn't machine-dependent. proc.h: Moved declaration of cpu_fork() to a better place. Only its implementation is machine-dependent.
# 14530	11-Mar-1996	hsu	Merge in Lite2: LIST changes stylistic changes to function prototypes Note: sizeof(struct proc) is now exactly 256 bytes, no room to spare.
# 14513	11-Mar-1996	hsu	From Lite2: moved acct_process() prototype over to acct.h. Reviewed by: davidg & bde
# 13765	30-Jan-1996	mpp	Fix a bunch of spelling errors in the comment fields of a bunch of system include files.
# 13606	24-Jan-1996	peter	proc.h: Add PHOLD()/PRELE() macros to ensure the U area is not swapped while doing ptrace/procfs in it. ptrace.h: uncomment PT_ATTACH/PT_DETACH, I am committing code to do this.
# 9507	13-Jul-1995	dg	NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: 1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers". 2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items. 3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed. 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. 5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance. 6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain. 7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance. 8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed. 9) Some almost useless debugging code removed. 10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology. 11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended. 12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course). 13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE. 14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes) TODO: 1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size. 2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness. 3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind. 4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems. 5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).
# 7090	16-Mar-1995	bde	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 6583	20-Feb-1995	dg	Added missing extern declaration for 'maxprocperuid'.
# 5455	09-Jan-1995	dg	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
# 4506	15-Nov-1994	bde	Include <sys/time.h> so that we compile again provided <sys/param.h> is included first (rtprio stuff broke this). Add missing __dead2's.
# 4438	13-Nov-1994	dg	Added P_SWAPPING flag to implement a lock for swap in. It was possible for a process to be chosen for swap-in while it was being swapped-out. This was BAD. Submitted by: John Dyson
# 3484	09-Oct-1994	phk	Cosmetics. (sort of) Added 19 prototypes.
# 3304	02-Oct-1994	phk	Prototypes, prototypes and even more prototypes. Not quite done yet, but getting closer all the time.
# 3297	02-Oct-1994	dg	Include rtprio.h for struct rtprio.
# 3291	02-Oct-1994	dg	"idle priority" support. Based on code from Henrik Vestergaard Draboel, but substantially rewritten by me.
# 2441	01-Sep-1994	dg	Realtime priority scheduling support. Submitted by: Henrik Vestergaard Draboel
# 2360	28-Aug-1994	bde	Reformat recent additions to the same style as the original.
# 2283	25-Aug-1994	wollman	Methinks p_sysent should be in the copy-on-fork rather than the zero-on-fork section of the struct proc. (At least, my system won't boot unless it is.)
# 2256	24-Aug-1994	sos	Changes preparing for iBCS support Reviewed by: Submitted by:
# 2112	18-Aug-1994	wollman	Fix up some sloppy coding practices: - Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above. NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
# 1817	02-Aug-1994	dg	Added $Id$
# 1549	25-May-1994	rgrimes	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# 1542	24-May-1994	rgrimes	This commit was generated by cvs2svn to compensate for changes in r1541, which included commits to RCS files with non-trunk default branches.
# 1541	24-May-1994	rgrimes	BSD 4.4 Lite Kernel Sources