History log of /freebsd-10.0-release/sys/nfsclient/nfs_bio.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 259065 07-Dec-2013 gjb

- Copy stable/10 (r259064) to releng/10.0 as part of the
10.0-RELEASE cycle.
- Update __FreeBSD_version [1]
- Set branch name to -RC1

[1] 10.0-CURRENT __FreeBSD_version value ended at '55', so
start releng/10.0 at '100' so the branch is started with
a value ending in zero.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 249623 18-Apr-2013 rmacklem

Both NFS clients can deadlock when using the "rdirplus" mount
option. This can occur when an nfsiod thread that already holds
a buffer lock attempts to acquire a vnode lock on an entry in
the directory (a LOR) when another thread holding the vnode lock
is waiting on an nfsiod thread. This patch avoids the deadlock by disabling
readahead for this case, so the nfsiod threads never do readdirplus.
Since readaheads for directories need the directory offset cookie
from the previous read, they cannot normally happen in parallel.
As such, testing by jhb@ and myself didn't find any performance
degredation when this patch is applied. If there is a case where
this results in a significant performance degradation, mounting
without the "rdirplus" option can be done to re-enable readahead
for directories.

Reported and tested by: jhb
Reviewed by: jhb
MFC after: 2 weeks


# 248500 19-Mar-2013 emaste

Fix remainder calculation when biosize is not a power of 2

In common configurations biosize is a power of two, but is not required to
be so. Thanks to markj@ for spotting an additional case beyond my original
patch.

Reviewed by: rmacklem@


# 248255 13-Mar-2013 jhb

Revert 195703 and 195821 as this special stop handling in NFS is now
implemented via VFCF_SBDRY rather than passing PBDRY to individual
sleep calls.


# 248084 09-Mar-2013 attilio

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 239246 14-Aug-2012 kib

Do not leave invalid pages in the object after the short read for a
network file systems (not only NFS proper). Short reads cause pages
other then the requested one, which were not filled by read response,
to stay invalid.

Change the vm_page_readahead_finish() interface to not take the error
code, but instead to make a decision to free or to (de)activate the
page only by its validity. As result, not requested invalid pages are
freed even if the read RPC indicated success.

Noted and reviewed by: alc
MFC after: 1 week


# 239065 05-Aug-2012 kib

After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by: alc
MFC after: 2 weeks


# 239040 04-Aug-2012 kib

Reduce code duplication and exposure of direct access to struct
vm_page oflags by providing helper function
vm_page_readahead_finish(), which handles completed reads for pages
with indexes other then the requested one, for VOP_GETPAGES().

Reviewed by: alc
MFC after: 1 week


# 235332 12-May-2012 rmacklem

PR# 165923 reported intermittent write failures for dirty
memory mapped pages being written back on an NFS mount.
Since any thread can call VOP_PUTPAGES() to write back a
dirty page, the credentials of that thread may not have
write access to the file on an NFS server. (Often the uid
is 0, which may be mapped to "nobody" in the NFS server.)
Although there is no completely correct fix for this
(NFS servers check access on every write RPC instead of at
open/mmap time), this patch avoids the common cases by
holding onto a credential that recently opened the file
for writing and uses that credential for the write RPCs
being done by VOP_PUTPAGES() for both NFS clients.

Tested by: Joel Ray Holveck (joelh at juniper.net)
PR: kern/165923
Reviewed by: kib
MFC after: 2 weeks


# 234605 23-Apr-2012 trasz

Remove unused thread argument from vtruncbuf().

Reviewed by: kib


# 232327 01-Mar-2012 rmacklem

Fix the NFS clients so that they use copyin() instead of bcopy(),
when doing direct I/O. This direct I/O code is not enabled by default.

Submitted by: kib (earlier version)
Reviewed by: kib
MFC after: 1 week


# 231949 20-Feb-2012 kib

Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with: bde, das (previous versions)
MFC after: 1 month


# 230605 27-Jan-2012 rmacklem

A problem with respect to data read through the buffer cache for both
NFS clients was reported to freebsd-fs@ under the subject "NFS
corruption in recent HEAD" on Nov. 26, 2011. This problem occurred when
a TCP mounted root fs was changed to using UDP. I believe that this
problem was caused by the change in mnt_stat.f_iosize that occurred
because rsize was decreased to the maximum supported by UDP. This
patch fixes the problem by using v_bufobj.bo_bsize instead of f_iosize,
since the latter is set to f_iosize when the vnode is allocated, but
does not change for a given vnode when f_iosize changes.

Reported by: pjd
Reviewed by: kib
MFC after: 2 weeks


# 228156 30-Nov-2011 kib

Rename vm_page_set_valid() to vm_page_set_valid_range().
The vm_page_set_valid() is the most reasonable name for the m->valid
accessor.

Reviewed by: attilio, alc


# 224733 09-Aug-2011 jhb

Merge 220876, 220877, and 221537 from the new NFS client to the old:
Allow the NFS client to use a max file size larger than 1TB for v3 mounts.
It now allows files up to OFF_MAX subject to whatever limit the server
advertises.

Reviewed by: rmacklem
Approved by: re (kib)
MFC after: 1 week


# 222586 01-Jun-2011 kib

In the VOP_PUTPAGES() implementations, change the default error from
VM_PAGER_AGAIN to VM_PAGER_ERROR for the uwritten pages. Return
VM_PAGER_AGAIN for the partially written page. Always forward at least
one page in the loop of vm_object_page_clean().

VM_PAGER_ERROR causes the page reactivation and does not clear the
page dirty state, so the write is not lost.

The change fixes an infinite loop in vm_object_page_clean() when the
filesystem returns permanent errors for some page writes.

Reported and tested by: gavin
Reviewed by: alc, rmacklem
MFC after: 1 week


# 221543 06-May-2011 rmacklem

Move sys/nfsclient/nfs_kdtrace.h to sys/nfs/nfs_kdtrace.h so
it can be used by the new NFS client as well as the old one.


# 214026 18-Oct-2010 kib

Do not synchronously start the nfsiod threads at all. The r212506
fixed the issues with file descriptor locks, but the same problems are
present for vnode lock/user map lock.

If the nfs_asyncio() cannot find the free nfsiod, schedule task to
create new nfsiod and return error. This causes fall back to the
synchronous i/o for nfs_strategy(), or does not start read at all in
the case of readahead. The caller that holds vnode and potentially
user map lock does not wait for kproc_create() to finish, preventing
the LORs.

The change effectively reverts r203072, because we never hand off the
request to newly created nfsiod thread anymore.

Reviewed by: jhb
Tested by: jhb, pluknet
MFC after: 3 weeks


# 209120 13-Jun-2010 kib

In NFS clients, instead of inconsistently using #ifdef
DIAGNOSTIC and #ifndef DIAGNOSTIC for debug assertions, prefer
KASSERT(). Also change one #ifdef DIAGNOSTIC in the new nfs server.

Submitted by: Mikolaj Golub <to.my.trociny gmail com>
MFC after: 2 weeks


# 207746 07-May-2010 alc

Push down the page queues lock into vm_page_activate().


# 207728 06-May-2010 alc

Eliminate page queues locking around most calls to vm_page_free().


# 207669 05-May-2010 alc

Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held. (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements. It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib


# 207662 05-May-2010 trasz

Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().

Reviewed by: kib


# 207584 03-May-2010 kib

Lock the page around vm_page_activate() and vm_page_deactivate() calls
where it was missed. The wrapped fragments now protect wire_count with
page lock.

Reviewed by: alc


# 203072 27-Jan-2010 rmacklem

Fix a race that can occur when nfs nfsiod threads are being created.
Without this patch it was possible for a different thread that calls
nfs_asyncio() to snitch a newly created nfsiod thread that was
intended for another caller of nfs_asyncio(), because the nfs_iod_mtx
mutex was unlocked while the new nfsiod thread was created. This patch
labels the newly created nfsiod, so that it is not taken by another
caller of nfs_asyncio(). This is believed to fix the problem reported
on the freebsd-stable email list under the subject:
FreeBSD NFS client/Linux NFS server issue.

Tested by: to DOT my DOT trociny AT gmail DOT com
Reviewed by: jhb
MFC after: 2 weeks


# 195703 14-Jul-2009 kib

Use PBDRY flag for msleep(9) in NFS and NLM when sleeping thread owns
kernel resources that block other threads, like vnode locks. The SIGSTOP
sent to such thread (process, rather) shall not stop it until thread
releases the resources.

Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)


# 195203 30-Jun-2009 dfr

Adjust the internal NFS KPI to avoid the last traces of NFS_LEGACYRPC.

Approved by: re


# 195202 30-Jun-2009 dfr

Remove the old kernel RPC implementation and the NFS_LEGACYRPC option.

Approved by: re


# 194425 18-Jun-2009 alc

Fix some of the style errors in *getpages().


# 193952 10-Jun-2009 rmacklem

Add a test for VI_DOOMED just after nfs_upgrade_vnlock() in
nfs_bioread_check_cons(). This is required since it is possible
for the vnode to be vgonel()'d while in nfs_upgrade_vnlock() when
a forced dismount is in progress. Also, move the check for VI_DOOMED
in nfs_vinvalbuf() down to after nfs_upgrade_vnlock() and replace the
out of date comment for it.

Submitted by: jhb
Tested by: pho
Approved by: kib (mentor)
MFC after: 1 month


# 193187 31-May-2009 alc

nfs_write() can use the recently introduced vfs_bio_set_valid() instead of
vfs_bio_set_validclean(), thereby avoiding the page queues lock.

Garbage collect vfs_bio_set_validclean(). Nothing uses it any longer.


# 192986 28-May-2009 alc

Make *getpages()s' assertion on the state of each page's dirty bits
stricter.


# 192578 22-May-2009 rwatson

Remove the unmaintained University of Michigan NFSv4 client from 8.x
prior to 8.0-RELEASE. Rick Macklem's new and more feature-rich NFSv234
client and server are replacing it.

Discussed with: rmacklem


# 192134 15-May-2009 alc

Eliminate unnecessary clearing of the page's dirty mask from various
getpages functions.

Eliminate a stale comment.


# 192010 12-May-2009 alc

Eliminate gratuitous clearing of the page's dirty mask.


# 191964 10-May-2009 alc

Eliminate stale comments.

Eliminate a case of unnecessary page queues locking.


# 190380 24-Mar-2009 rwatson

Add DTrace probes to the NFS access and attribute caches. Access cache
events are:

nfsclient:accesscache:flush:done
nfsclient:accesscache:get:hit
nfsclient:accesscache:get:miss
nfsclient:accesscache:load:done

They pass the vnode, uid, and requested or loaded access mode (if any);
the load event may also report a load error if the RPC fails.

The attribute cache events are:

nfsclient:attrcache:flush:done
nfsclient:attrcache:get:hit
nfsclient:attrcache:get:miss
nfsclient:attrcache:load:done

They pass the vnode, optionally the vattr if one is present (hit or load),
and in the case of a load event, also a possible RPC error.

MFC after: 1 month
Sponsored by: Google, Inc.


# 183754 10-Oct-2008 attilio

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 182371 28-Aug-2008 attilio

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 176134 09-Feb-2008 attilio

namei() can call underlying nfs_readlink() passing a struct uio pointer
owned by a NULL owner. This will lead consequent VOP_ISLOCKED() present
into nfs_upgrade_vnlock() to panic as it only acquire curthread now.
Fix nfs_upgrade_vnlock() and nfs_downgrade_vnlock() in order to not use
more the struct thread pointer passed as argument (as it is really nomore
required there as vn_lock() and VOP_UNLOCK doesn't get the lock more).
Using curthread, in place, doesn't get ambiguity as LK_EXCLOTHER should
be handled as a "not locked" request by both functions.

Reported by: kris
Tested by: kris
Reviewed by: ups


# 172324 25-Sep-2007 mohans

Fix for a very rare race, caused by the nfsiod wakeup and nfsiod idle
timeout occurring at exactly the same time. If this happens, the nfsiod
exits although there may be a queued async IO request for it.

Found by : Kris Kennaway
Approved by: re


# 171189 03-Jul-2007 jhb

Fix up NFS client write error handling. Errors are split into
recoverable and unrecoverable. For the former, we redirty the
buffer and hang onto it for future retries. For the latter (eg.
ESTALE), we discard the buffer and return the error back to the
user on the next syscall. This fixes a number of vfs panics and
fixes having a large number of dirty buffers (that cannot be
written out and reclaimed) from hanging around. Thanks to ups@
for discussions on this issue.

Reported by: kris, Kai, others
Approved by: re (kensmith)


# 170292 04-Jun-2007 attilio

Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)


# 170170 31-May-2007 attilio

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 169667 18-May-2007 jeff

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# 169043 25-Apr-2007 jhb

Various fixes to the NFS Directio support.
- Fix for a bug where a close would not wait for all (directio)
dirty buffers to drain. The nfsnode was not marked NMODIFIED
when there were directio dirtied buffers pending, causing this.
- No reason to vhold/vrele the vp when enqueueing DirectIO requests
for the nfsiods. The vnode can't really go way since the close
has to wait for these requests to drain.

MFC after: 1 week
Submitted by: mohans


# 161125 09-Aug-2006 alc

Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().


# 158915 25-May-2006 ups

Call vm_object_page_clean() with the object lock held.

Submitted by: kensmith@
Reviewed by: mohans@
MFC after: 6 days


# 158906 24-May-2006 ups

Do not set B_NOCACHE on buffers when releasing them in flushbuflist().
If B_NOCACHE is set the pages of vm backed buffers will be invalidated.
However clean buffers can be backed by dirty VM pages so invalidating them
can lead to data loss.
Add support for flush dirty page in the data invalidation function
of some network file systems.

This fixes data losses during vnode recycling (and other code paths
using invalbuf(*,V_SAVE,*,*)) for data written using an mmaped file.

Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@
Reviewed by: tegge@
MFC after: 7 days


# 158739 18-May-2006 mohans

Changes to make the NFS client MP safe.

Thanks to Kris Kennaway for testing and sending lots of bugs my way.


# 157557 05-Apr-2006 mohans

Keep track of the number of in-progress async direct IO writes in the nfsnode.
Make fsync/close wait until all of these drain. Add a check to nfs_getpage() and
nfs_putpage().


# 152656 21-Nov-2005 ps

- Always return success from NFS strategy. nfs_doio(), in the
event of an error, does the right thing, in terms of setting
the error flags in the buf header. That fixes a crash from
bstrategy().
- Treat ETIMEDOUT as a "recoverable" error, causing the buffer
to be re-dirtied. ETIMEDOUT can occur on soft mounts, when
the number of retries are exceeded, and we don't want data loss
in that case.

Submitted by: Mohan Srinivasan


# 148268 21-Jul-2005 ps

Remove the NFS client rslock. The rslock was used to serialize
writers that want to extend the file. It was also used to serialize
readers that might want to read the last block of the file (with a
writer extending the file). Now that we support vnode locking for
NFS, the rslock is unnecessary. Writers grab the exclusive vnode
lock before writing and readers grab the shared (or in some cases
the exclusive) lock.

Submitted by: Mohan Srinivasan


# 147420 16-Jun-2005 green

Ifdef out the incomplete non-blocking IO implementation for NFS
pending discussion of how implementation would proceed. Applications
like -lc_r expect select(3) to match the EAGAIN-status of IO
functions.

Approved by: re


# 147280 10-Jun-2005 green

Fix a serious deadlock with the NFS client. Given a large enough
atomic write request, it can fill the buffer cache with the entirety
of that write in order to handle retries. However, it never drops
the vnode lock, or else it wouldn't be atomic, so it ends up waiting
indefinitely for more buf memory that cannot be gotten as it has it
all, and it waits in an uncancellable state.

To fix this, hibufspace is exported and scaled to a reasonable
fraction. This is used as the limit of how much of an atomic write
request by the NFS client will be handled asynchronously. If the
request is larger than this, it will be turned into a synchronous
request which won't deadlock the system. It's possible this value is
far off from what is required by some, so it shall be tunable as soon
as mount_nfs(8) learns of the new field.

The slowdown between an asynchronous and a synchronous write on NFS
appears to be on the order of 2x-4x.

General nod by: gad
MFC after: 2 weeks
More testing: wes
PR: kern/79208


# 143822 18-Mar-2005 das

Don't brelse(bp) if bp is null. Also, eliminate some redundancy
and dead code.

Found by: Coverity Prevent analysis tool


# 143510 13-Mar-2005 jeff

- The VI_DOOMED flag now signals the end of a vnode's relationship with
the filesystem. Check that rather than VI_XLOCK.

Sponsored by: Isilon Systems, Inc.


# 140731 24-Jan-2005 phk

Remove unused cred arg from nfs_vinvalbuf() and many bogus arguments
passed for it.


# 140220 14-Jan-2005 phk

Eliminate unused and unnecessary "cred" argument from vinvalbuf()


# 139823 06-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 138899 15-Dec-2004 ps

First cut of NFS direct IO support.
- NFS direct IO completely bypasses the buffer and page caches.
If a file is open for direct IO all caching is disabled.
- Direct IO for Directories will be addressed later.
- 2 new NFS directio related sysctls are added. One is a knob to
disable NFS direct IO completely (direct IO is enabled by default).
The other is to disallow mmaped IO on a file that has at least one
O_DIRECT open (see the comment in nfs_vnops.c for more details).
The default is to allow mmaps on a file that has O_DIRECT opens.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Obtained from: Yahoo!


# 138644 10-Dec-2004 ps

Store a hint in the nfsnode to detect sequential access of the file.
Kick off a readahead only when sequential access is detected. This
eliminates wasteful readaheads in random file access.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Obtained from: Yahoo!


# 138496 06-Dec-2004 ps

Rewrite of the NFS client's reply handling. We now have NFS socket
upcalls which do RPC header parsing and match up the reply with the
request. NFS calls now sleep on the nfsreq structure. This enables
us to eliminate the NFS recvlock.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com


# 138473 06-Dec-2004 ps

2 fixes that improve on the consistency of the NFS client cache.
- Change the cached mtime to a 'struct timespec' from a
time_t. Improving the precision of the cached mtime tightens up
NFS' "close-to-open" consistency considerably.
- Always force an over-the-wire consistency check from nfs_open()
(unless the file is marked modified). This further improves
NFS' "close-to-open" consistency.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com


# 138469 06-Dec-2004 ps

Serialize NFS vinvalbuf operations by acquiring/upgrading to the
vnode EXCLUSIVE lock. This prevents threads from adding pages to
the vnode while an invalidation is in progress, closing potential
races. In the bioread() path, callers acquire the SHARED vnode lock
- so while an invalidate was in progress, it was possible to fault
in new pages onto the vnode causing the invalidation to take a while
or fail. We saw these races at Yahoo! with very large files+heavy
concurrent access. Forcing an upgrade to EXCLUSIVE lock before doing
the invalidation closes all these races.

Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com


# 137846 18-Nov-2004 jeff

- Eliminate the acquisition and release of the bqlock in bremfree() by
setting the B_REMFREE flag in the buf. This is done to prevent lock order
reversals with code that must call bremfree() with a local lock held.
This also reduces overhead by removing two lock operations per buf for
fsync() and similar.
- Check for the B_REMFREE flag in brelse() and bqrelse() after the bqlock
has been acquired so that we may remove ourself from the free-list.
- Provide a bremfreef() function to immediately remove a buf from a
free-list for use only by NFS. This is done because the nfsclient code
overloads the b_freelist queue for its own async. io queue.
- Simplify the numfreebuffers accounting by removing a switch statement
that executed the same code in every possible case.
- getnewbuf() can encounter locked bufs on free-lists once Giant is removed.
Remove a panic associated with this condition and delay asserts that
inspect the buf until after it is locked.

Reviewed by: phk
Sponsored by: Isilon Systems, Inc.


# 137197 04-Nov-2004 phk

Retire b_magic now, we have the bufobj containing the same hint.


# 136927 24-Oct-2004 phk

Move the buffer method vector (buf->b_op) to the bufobj.

Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().


# 136767 22-Oct-2004 phk

Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.

Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.


# 136006 01-Oct-2004 das

nfsclient/nfs_bio.c has a PHOLD() without a PRELE(). Neither should
be necessary here. Also, use killproc() instead of psignal().


# 135280 15-Sep-2004 phk

Remove unused B_WRITEINPROG flag


# 134898 07-Sep-2004 phk

Explicitly pass vnode to nfs_doio() and mountpoint to nfs_asyncio().


# 131691 06-Jul-2004 alfred

NFS mobility PHASE I, II & III (phase VI, and V pending):

Rebind the client socket when we experience a timeout. This fixes
the case where our IP changes for some reason.

Signal a VFS event when NFS transitions from up to down and vice
versa.

Add a placeholder vfs_sysctl where we will put status reporting
shortly.

Also:
Make down NFS mounts return EIO instead of EINTR when there is a
soft timeout or force unmount in progress.


# 130619 16-Jun-2004 rwatson

Remove bad cookie vp kernel printf; while it does notify about an
interesting event, there's little or nothing the user can do about
it.


# 128992 06-May-2004 alc

Make vm_page's PG_ZERO flag immutable between the time of the page's
allocation and deallocation. This flag's principal use is shortly after
allocation. For such cases, clearing the flag is pointless. The only
unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never
requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed
page.

Reviewed by: tegge@


# 128263 14-Apr-2004 peadar

Let the NFS client notice a file's size changing as a modification.
This avoids presenting invalid data to the client's applications
when the file is modified, and then extended within the window of
the resolution of the modifcation timestamp.

Reviewed By: iedowse
PR: kern/64091


# 127977 07-Apr-2004 imp

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson


# 126853 11-Mar-2004 phk

Properly vector all bwrite() and BUF_WRITE() calls through the same path
and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().


# 125454 04-Feb-2004 jhb

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


# 122953 22-Nov-2003 alfred

Use function pointers to remove the depenancy cross dependancy on nfs4
and the nfs3 client. Also fix some bugs that happen to be causing crashes
in both v3 and v4 introduced by the v4 import.

Submitted by: Jim Rees <rees@umich.edu>
Approved by: re


# 122698 14-Nov-2003 alfred

University of Michigan's Citi NFSv4 kernel client code.

Submitted by: Jim Rees <rees@umich.edu>


# 121201 18-Oct-2003 phk

Initialize bp->b_offset before calling VOP_STRATEGY().

Remove KASSERTS and panics with B_PHYS checks which no longer apply.


# 121191 18-Oct-2003 phk

We do not get B_PHYS buffers here anymore. /dev/drum is long gone.


# 120730 04-Oct-2003 jeff

- Remove the backtrace() call from the *_vinvalbuf() functions. Thanks to a
stack trace supplied by phk, I now understand what's going on here. The
check for VI_XLOCK stops us from calling vinvalbuf once the vnode has been
partially torn down in vclean(). It is not clear that this would cause
a problem. Document this in nfs_bio.c, which is where the other two
filesystems copied this code from.


# 120264 19-Sep-2003 jeff

- Remove interlock protection around VI_XLOCK. The interlock is not
sufficient to guarantee that this race is not hit. The XLOCK will likely
have to be redesigned due to the way reference counting and mutexes work
in FreeBSD. We currently can not be guaranteed that xlock was not set
and cleared while we were blocked on the interlock while waiting to check
for XLOCK. This would lead us to reference a vnode which was not the
vnode we requested.
- Add a backtrace() call inside of INVARIANTS in the hopes of finding out if
this condition is ever hit. It should not, since we should be retaining
a reference to the vnode in these cases. The reference would be sufficient
to block recycling.


# 116461 17-Jun-2003 alc

Lock the vm object when freeing a page.


# 115456 31-May-2003 phk

The IO_NOWDRAIN and B_NOWDRAIN hacks are no longer needed to prevent
deadlocks with vnode backed md(4) devices because md now uses a
kthread to run the bio requests instead of doing it directly from
the bio down path.


# 115041 15-May-2003 rwatson

This change grabs the vnode lock for NFS client vnodes when calling
VOP_SETATTR() or VOP_GETATTR(); without these locks (a) VFS_DEBUG_LOCKS
will panic, and (b) it may be possible to corrupt entries in the cached
vnode attributes in the nfsnode, since nfsnode attribute cache data is
also protected by the vnode lock.

Approved by: re (jhb)
Pointed out by: VFS_DEBUG_LOCKS


# 111856 03-Mar-2003 jeff

- Add a new 'flags' parameter to getblk().
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
flag to the initial BUF_LOCK(). This will eventually be used in cases
were we want to use a buffer only if it is not currently in use.
- Convert all consumers of the getblk() api to use this extra parameter.

Reviwed by: arch
Not objected to by: mckusick


# 111748 02-Mar-2003 des

More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).


# 108357 28-Dec-2002 dillon

Abstract-out the constants for the sequential heuristic.

No operational changes.

MFC after: 1 day


# 103939 25-Sep-2002 jeff

- Lock access to the buf lists.
- Use vrefcnt() where appropriate.
- Add some locking asserts.


# 101308 04-Aug-2002 jeff

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# 100450 21-Jul-2002 alc

o Lock page queue accesses in nfs_getpages().


# 100194 16-Jul-2002 dillon

Fix a bug nfs_write() related to ^C'ing during a file write on an
interruptable mount. We were returning from inside the loop without
releasing the rslock.

Submitted by: Mike Junk <junk@isilon.com>
MFC after: 3 days


# 99797 11-Jul-2002 dillon

Convert old style (type foo *)0 casts to NULLs

PR: kern/40360
Requested by: Hiten PAndya via direct email


# 99737 10-Jul-2002 dillon

Replace the global buffer hash table with per-vnode splay trees using a
methodology similar to the vm_map_entry splay and the VM splay that Alan
Cox is working on. Extensive testing has appeared to have shown no
increase in overhead.

Disadvantages
Dirties more cache lines during lookups.

Not as fast as a hash table lookup (but still N log N and optimal
when there is locality of reference).

Advantages
vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem
syncer operate more efficiently.

I get to rip out all the old hacks (some of which were mine) that tried
to keep the v_dirtyblkhd tailq sorted.

The per-vnode splay tree should be easier to lock / SMPng pushdown on
vnodes will be easier.

This commit along with another that Alan is working on for the VM page
global hash table will allow me to implement ranged fsync(), optimize
server-side nfs commit rpcs, and implement partial syncs by the
filesystem syncer (aka filesystem syncer would detect that someone is
trying to get the vnode lock, remembers its place, and skip to the
next vnode).

Note that the buffer cache splay is somewhat more complex then other splays
due to special handling of background bitmap writes (multiple buffers with
the same lblkno in the same vnode), and B_INVAL discontinuities between the
old hash table and the existence of the buffer on the v_cleanblkhd list.

Suggested by: alc


# 98988 28-Jun-2002 jhb

In namei(), we use a NULL thread for uio_td when doing a VOP_READLINK().
nfs_readlink() calls nfs_bioread() which passes in uio_td as the thread
argument to nfs_getcacheblk(). In nfs_getcacheblk() we dereference the
thread pointer to get a process pointer to pass to nfs_sigintr(). This
obviously results in a panic. :)

Rather than change nfs_getcacheblk() to check if the thread pointer is
NULL when calling nfs_sigintr() like other callers do, change
nfs_sigintr() to take a thread as the last argument instead of a
process so none of the callers have to care if the thread is NULL or not.


# 91406 27-Feb-2002 jhb

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 89407 15-Jan-2002 peter

Revise the nfsiod auto tuning code. Now both the upper and lower limits
are specifyable by sysctl and are respected.

Submitted by: Maxime Henrion <mux@sneakerz.org>


# 89324 14-Jan-2002 peter

Implement vfs.nfs.iodmin (minimum number of nfsiod's) and
vfs.nfs.iodmaxidle (idle time before nfsiod's exit). Make it adaptive
so that we create nfsiod's on demand and they go away after not being
used for a while. The upper limit is NFS_MAXASYNCDAEMON (currently 20).
More will be done here, but this is a useful checkpoint.

Submitted by: Maxime Henrion <mux@qualys.com>


# 87834 13-Dec-2001 dillon

This fixes a large number of bugs in our NFS client side code. A recent
commit by Kirk also fixed a softupdates bug that could easily be triggered
by server side NFS.

* An edge case with shared R+W mmap()'s and truncate whereby
the system would inappropriately clear the dirty bits on
still-dirty data. (applicable to all filesystems)

THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING.
see vm/vm_page.c line 1641

* The straddle case for VM pages and buffer cache buffers when
truncating. (applicable to NFS client side)

* Possible SMP database corruption due to vm_pager_unmap_page()
not clearing the TLB for the other cpu's. (applicable to NFS
client side but could effect all filesystems). Note: not
considered serious since the corruption occurs beyond the file
EOF.

* When flusing a dirty buffer due to B_CACHE getting cleared,
we were accidently setting B_CACHE again (that is, bwrite() sets
B_CACHE), when we really want it to stay clear after the write
is complete. This resulted in a corrupt buffer. (applicable
to all filesystems but probably only triggered by NFS)

* We have to call vtruncbuf() when ftruncate()ing to remove
any buffer cache buffers. This is still tentitive, I may
be able to remove it due to the second bug fix. (applicable
to NFS client side)

* vnode_pager_setsize() race against nfs_vinvalbuf()... we have
to set n_size before calling nfs_vinvalbuf or the NFS code
may recursively vnode_pager_setsize() to the original value
before the truncate. This is what was causing the user mmap
bus faults in the nfs tester program. (applicable to NFS
client side)

* Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made
by Kirk).

Testing program written by: Avadis Tevanian, Jr.
Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step')
MFC after: 1 week


# 86089 05-Nov-2001 dillon

Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking
in wdrain during a write. This flag needs to be used in devices whos
strategy routines turn-around and issue another high level I/O, such as
when MD turns around and issues a VOP_WRITE to vnode backing store, in order
to avoid deadlocking the dirty buffer draining code.

Remove a vprintf() warning from MD when the backing vnode is found to be
in-use. The syncer of buf_daemon could be flushing the backing vnode at
the time of an MD operation so the warning is not correct.

MFC after: 1 week


# 84827 11-Oct-2001 jhb

Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.


# 83654 18-Sep-2001 peter

Sigh, Last minute pre-merge typo. (missing quotes)


# 83651 18-Sep-2001 peter

Cleanup and split of nfs client and server code.
This builds on the top of several repo-copies.


# 83629 18-Sep-2001 imp

nfs_strategy calls nfs_asyncio with td as NULL. So add a bandaid that
will pass NULL as the struct proc when td is NULL. This has stopped
crashing on my machine.

Note: The passing of NULL may be bogus, but I'll let others fix that
problem.

Reviewed by: jhb


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 79247 04-Jul-2001 jhb

- Sort includes.
- Update vmmeter statistics for vnode pagein/pageouts in getpages/putpages.


# 79224 04-Jul-2001 dillon

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# 77086 23-May-2001 jhb

Assert Giant is held by the caller rather than getting it and releasing
it in getpages/putpages.


# 76827 18-May-2001 alfred

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 76117 29-Apr-2001 grog

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# 75858 23-Apr-2001 grog

Correct #includes to work with fixed sys/mount.h.


# 75692 19-Apr-2001 alfred

vnode_pager_freepage() is really vm_page_free() in disguise,
nuke vnode_pager_freepage() and replace all calls to it with vm_page_free()


# 75580 17-Apr-2001 phk

This patch removes the VOP_BWRITE() vector.

VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.


# 73929 07-Mar-2001 jhb

Grab the process lock while calling psignal and before calling psignal.


# 60041 05-May-2000 phk

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# 59249 15-Apr-2000 phk

Complete the bio/buf divorce for all code below devfs::strategy

Exceptions:
Vinum untouched. This means that it cannot be compiled.
Greg Lehey is on the case.

CCD not converted yet, casts to struct buf (still safe)

atapi-cd casts to struct buf to examine B_PHYS


# 58934 02-Apr-2000 phk

Move B_ERROR flag to b_ioflags and call it BIO_ERROR.

(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.


# 58349 20-Mar-2000 phk

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


# 58345 20-Mar-2000 phk

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


# 55431 05-Jan-2000 dillon

Enhance reassignbuf(). When a buffer cannot be time-optimally inserted
into vnode dirtyblkhd we append it to the list instead of prepend it to
the list in order to maintain a 'forward' locality of reference, which
is arguably better then 'reverse'. The original algorithm did things this
way to but at a huge time cost.

Enhance the append interlock for NFS writes to handle intr/soft mounts
better.

Fix the hysteresis for NFS async daemon I/O requests to reduce the
number of unnecessary context switches.

Modify handling of NFS mount options. Any given user option that is
too high now defaults to the kernel maximum for that option rather then
the kernel default for that option.

Reviewed by: Alfred Perlstein <bright@wintelcom.net>


# 54605 14-Dec-1999 dillon

Fix two problems: First, fix the append seek position race that can
occur due to np->n_size potentially changing if nfs_getcacheblk()
blocks in nfs_write().

Second, under -current we must supply the proper bufsize when obtaining
buffers that straddle the EOF, but due to the fact that np->n_size can
change out from under us it is possible that we may specify the wrong
buffer size and wind up truncating dirty data written by another
process.

Both problems are solved by implementing nfs_rslock(), which allows us
to lock around sensitive buffer cache operations such as those that
occur when appending to a file.

It is believed that this race is responsible for causing dirtyoff/dirtyend
and (in stable) validoff/validend to exceed the buffer size. Therefore
we have now added a warning printf for the dirtyoff/end case in current.

However, we have introduced a new problem which we need to fix at some
point, and that is that soft or intr NFS mounts may become
uninterruptable from the point of view of process A which is stuck waiting
on rslock while process B is stuck doing the rpc. To unstick process A,
process B would have to be interrupted first.

Reviewed by: Alfred Perlstein <bright@wintelcom.net>


# 54480 12-Dec-1999 dillon

Synopsis of problem being fixed: Dan Nelson originally reported that
blocks of zeros could wind up in a file written to over NFS by a client.
The problem only occurs a few times per several gigabytes of data. This
problem turned out to be bug #3 below.

bug #1:

B_CLUSTEROK must be cleared when an NFS buffer is reverted from
stage 2 (ready for commit rpc) to stage 1 (ready for write).
Reversions can occur when a dirty NFS buffer is redirtied with new
data.

Otherwise the VFS/BIO system may end up thinking that a stage 1
NFS buffer is clusterable. Stage 1 NFS buffers are not clusterable.

bug #2:

B_CLUSTEROK was inappropriately set for a 'short' NFS buffer (short
buffers only occur near the EOF of the file). Change to only set
when the buffer is a full biosize (usually 8K). This bug has no
effect but should be fixed in -current anyway. It need not be
backported.

bug #3:

B_NEEDCOMMIT was inappropriately set in nfs_flush() (which is
typically only called by the update daemon). nfs_flush()
does a multi-pass loop but due to the lack of vnode locking it
is possible for new buffers to be added to the dirtyblkhd list
while a flush operation is going on. This may result in nfs_flush()
setting B_NEEDCOMMIT on a buffer which has *NOT* yet gone through its
stage 1 write, causing only the commit rpc to be made and thus
causing the contents of the buffer to be thrown away (never sent to
the server).

The patch also contains some cleanup, which only applies to the commit
into -current.

Reviewed by: dg, julian
Originally Reported by: Dan Nelson <dnelson@emsphone.com>


# 52635 29-Oct-1999 phk

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# 51475 20-Sep-1999 dillon

Add comment to clarify a commit rpc optimization already being performed.


# 51344 17-Sep-1999 dillon

Asynchronized client-side nfs_commit. NFS commit operations were
previously issued synchronously even if async daemons (nfsiod's) were
available. The commit has been moved from the strategy code to the doio
code in order to asynchronize it.

Removed use of lastr in preparation for removal of vnode->v_lastr. It
has been replaced with seqcount, which is already supported by the system
and, in fact, gives us a better heuristic for sequential detection then
lastr ever did.

Made major performance improvements to the server side commit. The
server previously fsync'd the entire file for each commit rpc. The
server now bawrite()s only those buffers related to the offset/size
specified in the commit rpc.

Note that we do not commit the meta-data yet. This works still needs
to be done.

Note that a further optimization can be done (and has not yet been done)
on the client: we can merge multiple potential commit rpc's into a
single rpc with a greater file offset/size range and greatly reduce
rpc traffic.

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 49945 17-Aug-1999 alc

Add the (inline) function vm_page_undirty for clearing the dirty bitmask
of a vm_page.

Use it.

Submitted by: dillon


# 49659 12-Aug-1999 dt

nfs_getcacheblk() can return 0 if the mount is interruptible. It need to be
checked by the caller.

Broken in: rev. 1.70 (1999/05/02)


# 48225 26-Jun-1999 mckusick

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


# 47964 16-Jun-1999 mckusick

Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.


# 47749 05-Jun-1999 peter

Don't mistake a non-async block that needs to be committed for an
interrupted write.

Obtained from: fvdl@NetBSD.org via OpenBSD.


# 46580 06-May-1999 phk

remove b_proc from struct buf, it's (now) unused.

Reviewed by: dillon, bde


# 46349 02-May-1999 alc

The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


# 45361 06-Apr-1999 peter

Hold nfsd's upages in-core with PHOLD rather than P_NOSWAP.


# 45347 05-Apr-1999 julian

Catch a case spotted by Tor where files mmapped could leave garbage in the
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
ufs/ufs/ufs_readwrite.c kern/vfs_bio.c

Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>


# 44679 12-Mar-1999 julian

Reviewed by: Many at differnt times in differnt parts,
including alan, john, me, luoqi, and kirk
Submitted by: Matt Dillon <dillon@frebsd.org>

This change implements a relatively sophisticated fix to getnewbuf().
There were two problems with getnewbuf(). First, the writerecursion
can lead to a system stack overflow when you have NFS and/or VN
devices in the system. Second, the free/dirty buffer accounting was
completely broken. Not only did the nfs routines blow it trying to
manually account for the buffer state, but the accounting that was
done did not work well with the purpose of their existance: figuring
out when getnewbuf() needs to sleep.

The meat of the change is to kern/vfs_bio.c. The remaining diffs are
all minor except for NFS, which includes both the fixes for bp
interaction AND fixes for a 'biodone(): buffer already done' lockup.
Sys/buf.h also contains a chaining structure which is not used by
this patchset but is used by other patches that are coming soon.
This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF.
(sorry for the missing T matt)


# 42957 21-Jan-1999 dillon

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# 41791 14-Dec-1998 dt

(Hopefully) fix support for "large" files. Mostly cast block numbers to off_t
before they multiplied to block sizes.


# 41591 07-Dec-1998 archie

The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.


# 41026 09-Nov-1998 peter

Remove [apparently] bogus casts to u_long for the vnode_pager_setsize()
second argument. np_size is a 64 bit int, so is the second arg. This
might have caused needless 2G/4G file size problems.

I believe it was Bruce who queried this.


# 39782 29-Sep-1998 mckusick

Mark directory buffers that have no valid data with B_INVAL
so that they are not put in the cache.


# 39781 29-Sep-1998 mckusick

When adding data to a buffer, we need to clear the B_NEEDCOMMIT flag
which says that the data is on server but not committed.


# 38799 04-Sep-1998 dfr

Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.


# 36979 14-Jun-1998 bde

Avoid an egcs pessimization for 64-bit signed division on i386's.
Pre-2.8 versions of gcc generate a call to __divdi3() for all 64-bit
signed divisions, but egcs optimizes them to a shift and fixup when
the divisor is a constant power of 2. Unfortunately, it generates
a call to __cmpdi2() for the fixup, although all except possibly
ancient versions of gcc and egcs do ordinary 64-bit comparisons
inline.


# 36563 01-Jun-1998 peter

Make sure we go a nfs_fsinfo() in get/putpages before calling
readrpc/writerpc, since they assume it's already been done. This could
break if the first read/write access to a nfs filesystem was an exec() or
mmap() instead of a read(), write() syscall. (or statfs()).
nfs_getpages() could return an errno (EOPNOTSUPP) instead of a VM_PAGER_*
return code. Some layout tweaks for the get/putpages code.


# 36473 30-May-1998 peter

When using NFSv3, use the remote server's idea of the maximum file size
rather than assuming 2^64. It may not like files that big. :-)
On the nfs server, calculate and report the max file size as the point
that the block numbers in the cache would turn negative.
(ie: 1099511627775 bytes (1TB)).

One of the things I'm worried about however, is that directory offsets
are really cookies on a NFSv3 server and can be rather large, especially
when/if the server generates the opaque directory cookies by using a local
filesystem offset in what comes out as the upper 32 bits of the 64 bit
cookie. (a server is free to do this, it could save byte swapping
depending on the native 64 bit byte order)

Obtained from: NetBSD


# 36248 20-May-1998 peter

A cleaner fix for PR#5102, clear nonsense flags at mount time rather than
in the core of nfs_bio.c at the 11th hour.

PR: 5102


# 36176 19-May-1998 peter

Allow control of the attribute cache timeouts at mount time.

We had run out of bits in the nfs mount flags, I have moved the internal
state flags into a seperate variable. These are no longer visible via
statfs(), but I don't know of anything that looks at them.


# 34930 28-Mar-1998 steve

Don't allow the readdirplus routine to be used in NFS V2.

PR: 5102
Reviewed by: msmith
Submitted by: Dmitry Kohmanyuk <dk@farm.org>


# 34266 08-Mar-1998 julian

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


# 34206 07-Mar-1998 dyson

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# 34096 06-Mar-1998 msmith

Trivial filesystem getpages/putpages implementations, set the second.
These should be considered the first steps in a work-in-progress.
Submitted by: Terry Lambert <terry@freebsd.org>


# 33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


# 33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


# 32912 30-Jan-1998 tegge

Release the buffer when an error occurs while reading directory entries.


# 32755 25-Jan-1998 dyson

Various NFS fixes:
Make vfs_bio buffer mgmt work better.
Buffers were being used after brelse.
Make nfs_getpages work independently of other NFS
interfaces. This eliminates some difficult
recursion problems and decreases pagefault
overhead.
Remove an erroneous vfs_unbusy_pages.
Fix a reentrancy problem, with nfs_vinvalbuf when
vnode is already being rundown.
Reassignbuf wasn't being called when needed under
certain circumstances.

(Thanks to Bill Paul for help.)


# 32286 06-Jan-1998 dyson

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 31617 07-Dec-1997 dyson

Various of the ISP users have commented that the 1.41 version of the
nfs_bio.c code worked better than the 1.44. This commit reverts
the important parts of 1.44 to 1.41, and we will fix it when we
can get a handle on the problem.


# 29288 10-Sep-1997 phk

unifdef -U__NetBSD__ -D__FreeBSD__


# 27845 02-Aug-1997 bde

Removed unused #includes.


# 26929 25-Jun-1997 dfr

Avoid small synchronous writes when an application does lots of random-access
short writes within a block (e.g. ld).


# 26669 15-Jun-1997 dyson

Upgrade NFS to support the new vfs_bio resource/buffer management.


# 26469 06-Jun-1997 dfr

Fix a problem caused by removing large numbers of files from a directory
which could cause a bad size to be given to uiomove, causing a page fault.


# 26409 03-Jun-1997 dfr

Fix some performance problems with the NFS mmap fixes.


# 25930 19-May-1997 dfr

Fix a few bugs with NFS and mmap caused by NFS' use of b_validoff
and b_validend. The changes to vfs_bio.c are a bit ugly but hopefully
can be tidied up later by a slight redesign.

PR: kern/2573, kern/2754, kern/3046 (possibly)
Reviewed by: dyson


# 25785 13-May-1997 dfr

Check the B_CLUSTER flag when choosing whether to use unstable or filesync
writes.

PR: kern/3438
Submitted by: Tor Egge <Tor.Egge@idi.ntnu.no>


# 25023 19-Apr-1997 dfr

Fix a bug where a program which appended many small records to a file could
wind up writing zeros instead of real data when the file is on an NFSv2
mounted directory.

While tracking this bug down, I noticed that nfs_asyncio was waking *all*
the iods when a block was written instead of just one per block. Fixing this
gives a 25% performance improvment for writes on v2 (less for v3).

Both are 2.2 candidates.

PR: kern/2774


# 25003 18-Apr-1997 dfr

Don't allow partial buffers to be cluster-comitted.
Zero the b_dirty{off,end} after cluster-comitting a group of buffers.

With these fixes, I was able to complete a 'make world' with remote src
and obj directories.


# 24577 03-Apr-1997 dfr

The code which recovered from a modified directory situation did not check
for eof when re-caching the directory. This could cause it to loop forever
if a directory was truncated.


# 23570 09-Mar-1997 bde

YAMInTheWrongDirectionF22 (part of rev.1.28.2.3: set B_CLUSTEROK for
commits).


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 22521 10-Feb-1997 dyson

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 19449 06-Nov-1996 dfr

Improve the queuing algorithms used by NFS' asynchronous i/o. The
existing mechanism uses a global queue for some buffers and the
vp->b_dirtyblkhd queue for others. This turns sequential writes into
randomly ordered writes to the server, affecting both read and write
performance. The existing mechanism also copes badly with hung
servers, tending to block accesses to other servers when all the iods
are waiting for a hung server.

The new mechanism uses a queue for each mount point. All asynchronous
i/o goes through this queue which preserves the ordering of requests.
A simple mechanism ensures that the iods are shared out fairly between
active mount points. This removes the sysctl variable vfs.nfs.dwrite
since the new queueing mechanism removes the old delayed write code
completely.

This should go into the 2.2 branch.


# 19070 21-Oct-1996 dfr

If a large (>4096 bytes) directory was modified, the old directory
contents are discarded, including the cached seek cookies.
Unfortunately, if the directory was larger than NFS_DIRBLKSIZ, then
this confused nfs_readdirrpc(), making it appear as if the directory
was truncated.

Reviewed by: Karl Denninger <karl@Mcs.Net>


# 18888 12-Oct-1996 bde

Staticized `nfs_dwrite'.


# 18866 11-Oct-1996 dfr

This fixes a problem with the nfs socket handling code which happens
if a single process is performing a large number of requests (in this
case writing a large file). The writing process could monopolise the
recieve lock and prevent any other processes from recieving their
replies.

It also adds a new sysctl variable 'vfs.nfs.dwrite' which controls the
behaviour which originally pointed out the problem. When a process
writes to a file over NFS, it usually arranges for another process
(the 'iod') to perform the request. If no iods are available, then it
turns the write into a 'delayed write' which is later picked up by the
next iod to do a write request for that file. This can cause that
particular iod to do a disproportionate number of requests from a
single process which can harm performance on some NFS servers. The
alternative is to perform the write synchronously in the context of
the original writing process if no iod is avaiable for asynchronous
writing.

The 'delayed write' behaviour is selected when vfs.nfs.dwrite=1 and
the non-delayed behaviour is selected when vfs.nfs.dwrite=0. The
default is vfs.nfs.dwrite=1; if many people tell me that performance
is better if vfs.nfs.dwrite=0 then I will change the default.

Submitted by: Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp>


# 18397 19-Sep-1996 nate

In sys/time.h, struct timespec is defined as:

/*
* Structure defined by POSIX.4 to be like a timeval.
*/
struct timespec {
time_t ts_sec; /* seconds */
long ts_nsec; /* and nanoseconds */
};

The correct names of the fields are tv_sec and tv_nsec.

Reminded by: James Drobina <jdrobina@infinet.com>


# 17186 16-Jul-1996 dfr

Various fixes from frank@fwi.uva.nl (Frank van der Linden) via
rick@snowhite.cis.uoguelph.ca:

1. Clear B_NEEDCOMMIT in nfs_write to make sure that dirty data is
correctly send to the server. If a buffer was dirtied when it was in
the B_DELWRI+B_NEEDCOMMIT state, the state of the buffer was left
unchanged and when the buffer was later cleaned, just a commit rpc was
made to the server to complete the previous write. Clearing
B_NEEDCOMMIT ensures that another write is made to the server.

2. If a server returned a server (for whatever reason) returned an
answer to a write RPC that implied that fewer bytes than requested
were written, bad things would happen.

3. The setattr operation passed on the atime in stead of the mtime to
the server. The fix is trivial.

4. XIDs always started at 0, but this caused some servers (older DEC
OSF/1 3.0 so I've been told) who had very long-lasting XID caches to
get confused if, after a reboot of a BSD client, RPCs came in with a
XID that had in the past been used before from that client. Patch is
to use the current time in seconds as a starting point for XIDs. The
patch below is not perfect, because it requires the root fs to be
mounted first. This is because of the check BSD systems do, comparing
FS time to system time.

Reviewed by: Bruce Evans, Terry Lambert.
Obtained from: frank@fwi.uva.nl (Frank van der Linden) via rick@snowhite.cis.uoguelph.ca


# 16192 08-Jun-1996 pst

Clear flags before using an inactive buffer. This is a kludge, but
matches the code in bread().

Reviewed by: bde


# 13612 24-Jan-1996 mpp

Add a check to prevent a computation from underflowing and causing
a panic due to an attaempt to allocate a buffer for a terabyte or
so of data when an attempt is made to create sparse data (e.g.
a holey file) more than 1 block past the end of the file.

Note: some other areas of this code need to be looked at,
since they might cause problems when the file size exceeds 2GB,
due to storing results in ints when the computations are being
done with quad sized variables.

Reviewed by: bde


# 12911 17-Dec-1995 phk

Staticize.


# 12662 07-Dec-1995 dg

Untangled the vm.h include file spaghetti.


# 12588 03-Dec-1995 bde

Completed function declarations and/or added prototypes and/or moved
prototypes to the right place.


# 11921 29-Oct-1995 phk

Second batch of cleanup changes.
This time mostly making a lot of things static and some unused
variables here and there.


# 10219 24-Aug-1995 dfr

Add support for amd direct maps.

Reviewed by: Thomas Graichen <graichen@sirius.physik.fu-berlin.de>


# 9428 07-Jul-1995 dfr

Use a consistent blocksize for sizing bufs to avoid panicing the bio system.


# 9336 27-Jun-1995 dfr

Changes to support version 3 of the NFS protocol.
The version 2 support has been tested (client+server) against FreeBSD-2.0,
IRIX 5.3 and FreeBSD-current (using a loopback mount). The version 2 support
is stable AFAIK.
The version 3 support has been tested with a loopback mount and minimally
against an IRIX 5.3 server. It needs more testing and may have problems.
I have patched amd to support the new variable length filehandles although
it will still only use version 2 of the protocol.

Before booting a kernel with these changes, nfs clients will need to at least
build and install /usr/sbin/mount_nfs. Servers will need to build and
install /usr/sbin/mountd.

NFS diskless support is untested.

Obtained from: Rick Macklem <rick@snowhite.cis.uoguelph.ca>


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 8692 21-May-1995 dg

Changes to fix the following bugs:

1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.

Reviewed by: David Greenman
Submitted by: John Dyson


# 7871 16-Apr-1995 dg

Various fixes from John Dyson:

1) Rewrote screwy code that uses an incore buffer without making it busy.
2) Use B_CACHE instead of B_DONE in cases where it is appropriate.
3) Minor code optimization.

This *might* fix kern/345 submitted by Heikki Suonsivu.


# 6875 04-Mar-1995 dg

Removed obsolete vtrace() remnants.


# 6148 03-Feb-1995 dg

Removed a pile of vfs_unbusy_pages()...both unnecessary and wrong - resulted
in serious system instability. Changed a B_INVAL to a B_NOCACHE so that
buffer data is properly disposed of.

Submitted by: John Dyson, Rick Macklin, and ohki@gssm.otsuka.tsukuba.ac.jp


# 5471 10-Jan-1995 dg

Added two missing brelse() calls.

Submitted by: rick@snowhite.cis.uoguelph.ca


# 5455 09-Jan-1995 dg

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 3664 17-Oct-1994 phk

This is a bunch of changes from NetBSD. There are a couple of bug-fixes.
But mostly it is changes to use the list-maintenance macros instead of
doing the pointer-gymnastics by hand.

Obtained from: NetBSD


# 3305 02-Oct-1994 phk

Prototyping and general gcc-shutting up. Gcc has one warning now which looks
bad, I will get to it eventually, unless somebody beats me to it.


# 2112 18-Aug-1994 wollman

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 1937 08-Aug-1994 dg

Changed B_AGE policy to work correctly in a world with relatively large
buffer caches. The old policy generally ended up caching nothing.


# 1817 02-Aug-1994 dg

Added $Id$


# 1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources