#
292348 |
|
16-Dec-2015 |
ken |
MFC r291716, r291724, r291741, r291742
In addition to those revisions, add this change to a file that is not in head:
sys/ia64/include/bus.h: Guard kernel-only parts of the ia64 machine/bus.h header with #ifdef _KERNEL.
This allows userland programs to include <machine/bus.h> to get the definition of bus_addr_t and bus_size_t.
------------------------------------------------------------------------ r291716 | ken | 2015-12-03 15:54:55 -0500 (Thu, 03 Dec 2015) | 257 lines
Add asynchronous command support to the pass(4) driver, and the new camdd(8) utility.
CCBs may be queued to the driver via the new CAMIOQUEUE ioctl, and completed CCBs may be retrieved via the CAMIOGET ioctl. User processes can use poll(2) or kevent(2) to get notification when I/O has completed.
While the existing CAMIOCOMMAND blocking ioctl interface only supports user virtual data pointers in a CCB (generally only one per CCB), the new CAMIOQUEUE ioctl supports user virtual and physical address pointers, as well as user virtual and physical scatter/gather lists. This allows user applications to have more flexibility in their data handling operations.
Kernel memory for data transferred via the queued interface is allocated from the zone allocator in MAXPHYS sized chunks, and user data is copied in and out. This is likely faster than the vmapbuf()/vunmapbuf() method used by the CAMIOCOMMAND ioctl in configurations with many processors (there are more TLB shootdowns caused by the mapping/unmapping operation) but may not be as fast as running with unmapped I/O.
The new memory handling model for user requests also allows applications to send CCBs with request sizes that are larger than MAXPHYS. The pass(4) driver now limits queued requests to the I/O size listed by the SIM driver in the maxio field in the Path Inquiry (XPT_PATH_INQ) CCB.
There are some things things would be good to add:
1. Come up with a way to do unmapped I/O on multiple buffers. Currently the unmapped I/O interface operates on a struct bio, which includes only one address and length. It would be nice to be able to send an unmapped scatter/gather list down to busdma. This would allow eliminating the copy we currently do for data.
2. Add an ioctl to list currently outstanding CCBs in the various queues.
3. Add an ioctl to cancel a request, or use the XPT_ABORT CCB to do that.
4. Test physical address support. Virtual pointers and scatter gather lists have been tested, but I have not yet tested physical addresses or scatter/gather lists.
5. Investigate multiple queue support. At the moment there is one queue of commands per pass(4) device. If multiple processes open the device, they will submit I/O into the same queue and get events for the same completions. This is probably the right model for most applications, but it is something that could be changed later on.
Also, add a new utility, camdd(8) that uses the asynchronous pass(4) driver interface.
This utility is intended to be a basic data transfer/copy utility, a simple benchmark utility, and an example of how to use the asynchronous pass(4) interface.
It can copy data to and from pass(4) devices using any target queue depth, starting offset and blocksize for the input and ouptut devices. It currently only supports SCSI devices, but could be easily extended to support ATA devices.
It can also copy data to and from regular files, block devices, tape devices, pipes, stdin, and stdout. It does not support queueing multiple commands to any of those targets, since it uses the standard read(2)/write(2)/writev(2)/readv(2) system calls.
The I/O is done by two threads, one for the reader and one for the writer. The reader thread sends completed read requests to the writer thread in strictly sequential order, even if they complete out of order. That could be modified later on for random I/O patterns or slightly out of order I/O.
camdd(8) uses kqueue(2)/kevent(2) to get I/O completion events from the pass(4) driver and also to send request notifications internally.
For pass(4) devcies, camdd(8) uses a single buffer (CAM_DATA_VADDR) per CAM CCB on the reading side, and a scatter/gather list (CAM_DATA_SG) on the writing side. In addition to testing both interfaces, this makes any potential reblocking of I/O easier. No data is copied between the reader and the writer, but rather the reader's buffers are split into multiple I/O requests or combined into a single I/O request depending on the input and output blocksize.
For the file I/O path, camdd(8) also uses a single buffer (read(2), write(2), pread(2) or pwrite(2)) on reads, and a scatter/gather list (readv(2), writev(2), preadv(2), pwritev(2)) on writes.
Things that would be nice to do for camdd(8) eventually:
1. Add support for I/O pattern generation. Patterns like all zeros, all ones, LBA-based patterns, random patterns, etc. Right Now you can always use /dev/zero, /dev/random, etc.
2. Add support for a "sink" mode, so we do only reads with no writes. Right now, you can use /dev/null.
3. Add support for automatic queue depth probing, so that we can figure out the right queue depth on the input and output side for maximum throughput. At the moment it defaults to 6.
4. Add support for SATA device passthrough I/O.
5. Add support for random LBAs and/or lengths on the input and output sides.
6. Track average per-I/O latency and busy time. The busy time and latency could also feed in to the automatic queue depth determination.
sys/cam/scsi/scsi_pass.h: Define two new ioctls, CAMIOQUEUE and CAMIOGET, that queue and fetch asynchronous CAM CCBs respectively.
Although these ioctls do not have a declared argument, they both take a union ccb pointer. If we declare a size here, the ioctl code in sys/kern/sys_generic.c will malloc and free a buffer for either the CCB or the CCB pointer (depending on how it is declared). Since we have to keep a copy of the CCB (which is fairly large) anyway, having the ioctl malloc and free a CCB for each call is wasteful.
sys/cam/scsi/scsi_pass.c: Add asynchronous CCB support.
Add two new ioctls, CAMIOQUEUE and CAMIOGET.
CAMIOQUEUE adds a CCB to the incoming queue. The CCB is executed immediately (and moved to the active queue) if it is an immediate CCB, but otherwise it will be executed in passstart() when a CCB is available from the transport layer.
When CCBs are completed (because they are immediate or passdone() if they are queued), they are put on the done queue.
If we get the final close on the device before all pending I/O is complete, all active I/O is moved to the abandoned queue and we increment the peripheral reference count so that the peripheral driver instance doesn't go away before all pending I/O is done.
The new passcreatezone() function is called on the first call to the CAMIOQUEUE ioctl on a given device to allocate the UMA zones for I/O requests and S/G list buffers. This may be good to move off to a taskqueue at some point. The new passmemsetup() function allocates memory and scatter/gather lists to hold the user's data, and copies in any data that needs to be written. For virtual pointers (CAM_DATA_VADDR), the kernel buffer is malloced from the new pass(4) driver malloc bucket. For virtual scatter/gather lists (CAM_DATA_SG), buffers are allocated from a new per-pass(9) UMA zone in MAXPHYS-sized chunks. Physical pointers are passed in unchanged. We have support for up to 16 scatter/gather segments (for the user and kernel S/G lists) in the default struct pass_io_req, so requests with longer S/G lists require an extra kernel malloc.
The new passcopysglist() function copies a user scatter/gather list to a kernel scatter/gather list. The number of elements in each list may be different, but (obviously) the amount of data stored has to be identical.
The new passmemdone() function copies data out for the CAM_DATA_VADDR and CAM_DATA_SG cases.
The new passiocleanup() function restores data pointers in user CCBs and frees memory.
Add new functions to support kqueue(2)/kevent(2):
passreadfilt() tells kevent whether or not the done queue is empty.
passkqfilter() adds a knote to our list.
passreadfiltdetach() removes a knote from our list.
Add a new function, passpoll(), for poll(2)/select(2) to use.
Add devstat(9) support for the queued CCB path.
sys/cam/ata/ata_da.c: Add support for the BIO_VLIST bio type.
sys/cam/cam_ccb.h: Add a new enumeration for the xflags field in the CCB header. (This doesn't change the CCB header, just adds an enumeration to use.)
sys/cam/cam_xpt.c: Add a new function, xpt_setup_ccb_flags(), that allows specifying CCB flags.
sys/cam/cam_xpt.h: Add a prototype for xpt_setup_ccb_flags().
sys/cam/scsi/scsi_da.c: Add support for BIO_VLIST.
sys/dev/md/md.c: Add BIO_VLIST support to md(4).
sys/geom/geom_disk.c: Add BIO_VLIST support to the GEOM disk class. Re-factor the I/O size limiting code in g_disk_start() a bit.
sys/kern/subr_bus_dma.c: Change _bus_dmamap_load_vlist() to take a starting offset and length.
Add a new function, _bus_dmamap_load_pages(), that will load a list of physical pages starting at an offset.
Update _bus_dmamap_load_bio() to allow loading BIO_VLIST bios. Allow unmapped I/O to start at an offset.
sys/kern/subr_uio.c: Add two new functions, physcopyin_vlist() and physcopyout_vlist().
sys/pc98/include/bus.h: Guard kernel-only parts of the pc98 machine/bus.h header with #ifdef _KERNEL.
This allows userland programs to include <machine/bus.h> to get the definition of bus_addr_t and bus_size_t.
sys/sys/bio.h: Add a new bio flag, BIO_VLIST.
sys/sys/uio.h: Add prototypes for physcopyin_vlist() and physcopyout_vlist().
share/man/man4/pass.4: Document the CAMIOQUEUE and CAMIOGET ioctls.
usr.sbin/Makefile: Add camdd.
usr.sbin/camdd/Makefile: Add a makefile for camdd(8).
usr.sbin/camdd/camdd.8: Man page for camdd(8).
usr.sbin/camdd/camdd.c: The new camdd(8) utility.
Sponsored by: Spectra Logic
------------------------------------------------------------------------ r291724 | ken | 2015-12-03 17:07:01 -0500 (Thu, 03 Dec 2015) | 6 lines
Fix typos in the camdd(8) usage() function output caused by an error in my diff filter script.
Sponsored by: Spectra Logic
------------------------------------------------------------------------ r291741 | ken | 2015-12-03 22:38:35 -0500 (Thu, 03 Dec 2015) | 10 lines
Fix g_disk_vlist_limit() to work properly with deletes.
Add a new bp argument to g_disk_maxsegs(), and add a new function, g_disk_maxsize() tha will properly determine the maximum I/O size for a delete or non-delete bio.
Submitted by: will Sponsored by: Spectra Logic
------------------------------------------------------------------------ ------------------------------------------------------------------------ r291742 | ken | 2015-12-03 22:44:12 -0500 (Thu, 03 Dec 2015) | 5 lines
Fix a style issue in g_disk_limit().
Noticed by: bdrewery
------------------------------------------------------------------------
Sponsored by: Spectra Logic
|
#
256281 |
|
10-Oct-2013 |
gjb |
Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.
Approved by: re (implicit) Sponsored by: The FreeBSD Foundation |
#
248654 |
|
23-Mar-2013 |
will |
Be more explicit about what each bio_cmd & bio_flags value means.
Reviewed by: ken (mentor)
|
#
248508 |
|
19-Mar-2013 |
kib |
Implement the concept of the unmapped VMIO buffers, i.e. buffers which do not map the b_pages pages into buffer_map KVA. The use of the unmapped buffers eliminate the need to perform TLB shootdown for mapping on the buffer creation and reuse, greatly reducing the amount of IPIs for shootdown on big-SMP machines and eliminating up to 25-30% of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED flag by the consumer. For unmapped buffer, no KVA reservation is performed at all. The consumer might request unmapped buffer which does have a KVA reserve, to manually map it without recursing into buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already exists, the cache performs an upgrade, possibly reusing the KVA reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy(). Unmapped bio carry a pointer to the vm_page_t array, offset and length instead of the data pointer. The provider which processes the bio should explicitely specify a readiness to accept unmapped bio, otherwise g_down geom thread performs the transient upgrade of the bio request by mapping the pages into the new bio_transient_map KVA submap.
The bio_transient_map submap claims up to 10% of the buffer map, and the total buffer_map + bio_transient_map KVA usage stays the same. Still, it could be manually tuned by kern.bio_transient_maxcnt tunable, in the units of the transient mappings. Eventually, the bio_transient_map could be removed after all geom classes and drivers can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed tunable, disabling which makes the buffer (or cluster) creation requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped buffers are only enabled by default on the architectures where pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace limit anymore. Since the metadata buffers are always mapped, the buffers still have to fit into the buffer map, which provides a reasonable (but practically unreachable) upper bound on it. The non-metadata buffer allocations, both mapped and unmapped, is accounted against maxbufspace, as before. Effectively, this means that the maxbufspace is forced on mapped and unmapped buffers separately. The pre-patch bufspace limiting code did not worked, because buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation Discussed with: jeff (previous version) Tested by: pho, scottl (previous version), jhb, bf MFC after: 2 weeks
|
#
212160 |
|
02-Sep-2010 |
gibbs |
Correct bioq_disksort so that bioq_insert_tail() offers barrier semantic. Add the BIO_ORDERED flag for struct bio and update bio clients to use it.
The barrier semantics of bioq_insert_tail() were broken in two ways:
o In bioq_disksort(), an added bio could be inserted at the head of the queue, even when a barrier was present, if the sort key for the new entry was less than that of the last queued barrier bio.
o The last_offset used to generate the sort key for newly queued bios did not stay at the position of the barrier until either the barrier was de-queued, or a new barrier (which updates last_offset) was queued. When a barrier is in effect, we know that the disk will pass through the barrier position just before the "blocked bios" are released, so using the barrier's offset for last_offset is the optimal choice.
sys/geom/sched/subr_disk.c: sys/kern/subr_disk.c: o Update last_offset in bioq_insert_tail().
o Only update last_offset in bioq_remove() if the removed bio is at the head of the queue (typically due to a call via bioq_takefirst()) and no barrier is active.
o In bioq_disksort(), if we have a barrier (insert_point is non-NULL), set prev to the barrier and cur to it's next element. Now that last_offset is kept at the barrier position, this change isn't strictly necessary, but since we have to take a decision branch anyway, it does avoid one, no-op, loop iteration in the while loop that immediately follows.
o In bioq_disksort(), bypass the normal sort for bios with the BIO_ORDERED attribute and instead insert them into the queue with bioq_insert_tail(). bioq_insert_tail() not only gives the desired command order during insertion, but also provides barrier semantics so that commands disksorted in the future cannot pass the just enqueued transaction.
sys/sys/bio.h: Add BIO_ORDERED as bit 4 of the bio_flags field in struct bio.
sys/cam/ata/ata_da.c: sys/cam/scsi/scsi_da.c Use an ordered command for SCSI/ATA-NCQ commands issued in response to bios with the BIO_ORDERED flag set.
sys/cam/scsi/scsi_da.c Use an ordered tag when issuing a synchronize cache command.
Wrap some lines to 80 columns.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c sys/geom/geom_io.c Mark bios with the BIO_FLUSH command as BIO_ORDERED.
Sponsored by: Spectra Logic Corporation MFC after: 1 month
|
#
200403 |
|
11-Dec-2009 |
luigi |
only export bio_cmd and flags to userland (bio_cmd are used by ggatectl, flags are potentially useful). Other parts are internal kernel data structures and should not be visible to userland.
No API change involved.
MFC after: 3 days
|
#
193981 |
|
11-Jun-2009 |
luigi |
As discussed in the devsummit, introduce two fields in the struct bio to store classification information, and a hook for classifier functions that can be called by g_io_request().
This code is from Fabio Checconi as part of his GSOC work.
|
#
163832 |
|
31-Oct-2006 |
pjd |
Add a new I/O request - BIO_FLUSH, which basically tells providers below to flush their caches. For now will mostly be used by disks to flush their write cache.
Sponsored by: home.pl
|
#
156170 |
|
01-Mar-2006 |
pjd |
Assert proper use of bio_caller1, bio_caller2, bio_cflags, bio_driver1, bio_driver2 and bio_pflags fields.
Reviewed by: phk
|
#
152739 |
|
23-Nov-2005 |
kris |
Correct division by zero error in comment.
|
#
147324 |
|
12-Jun-2005 |
jeff |
- switch_point is now unused. This doesn't break module binary compatability since the structure is shrinking, not growing.
|
#
139825 |
|
07-Jan-2005 |
imp |
/* -> /*- for license, minor formatting changes
|
#
138800 |
|
13-Dec-2004 |
pjd |
Add bioq_insert_head() function.
OK'd by: phk
|
#
134519 |
|
30-Aug-2004 |
phk |
Add more KASSERTS and checks.
|
#
134038 |
|
19-Aug-2004 |
phk |
Add bioq_takefirst().
If the bioq is empty, NULL is returned. Otherwise the front element is removed and returned.
This can simplify locking in many drivers from:
lock() bp = bioq_first(bq); if (bp == NULL) { unlock() return } bioq_remove(bp, bq) unlock to: lock() bp = bioq_takefirst(bq); unlock() if (bp == NULL) return;
|
#
133142 |
|
04-Aug-2004 |
pjd |
- Add two fields to bio structure: 'bio_cflags' which can be used by consumer and 'bio_pflags' which can be used by provider. - Remove BIO_FLAG1 and BIO_FLAG2 flags. From now on new fields should be used for internal flags. - Update g_bio(9) manual page. - Update some comments. - Update GEOM_MIRROR, which was the only one using BIO_FLAGs.
Idea from: phk Reviewed by: phk
|
#
130585 |
|
16-Jun-2004 |
phk |
Do the dreaded s/dev_t/struct cdev */ Bump __FreeBSD_version accordingly.
|
#
127976 |
|
07-Apr-2004 |
imp |
Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999.
Approved by: core
|
#
125137 |
|
28-Jan-2004 |
phk |
Bring back the geom_bioqueues, they _are_ a good idea.
ATA will uses these RSN.
|
#
121216 |
|
18-Oct-2003 |
phk |
Retire bio_blkno entirely.
bio_offset is the field drivers should use. bio_pblkno remains as a convenient place to store the number of the device drivers.
|
#
121207 |
|
18-Oct-2003 |
phk |
Make bioq_disksort() sort on the bio_offset field instead of bio_pblkno.
|
#
113392 |
|
12-Apr-2003 |
phk |
Retire the experimental bio_taskqueue(), it was not quite as usable as hoped. It can be revived from here, should other drivers be able to use it.
|
#
113033 |
|
03-Apr-2003 |
phk |
Remove BIO_SETATTR from non-GEOM part of kernel as well.
|
#
112951 |
|
01-Apr-2003 |
phk |
Remove the #define for bioqdisksort(), it's no longer needed.
|
#
112941 |
|
01-Apr-2003 |
phk |
Introduce bioq_flush() function.
|
#
112848 |
|
30-Mar-2003 |
phk |
retire the "busy" field in bioqueues, it's served it's purpose.
|
#
112846 |
|
30-Mar-2003 |
phk |
Preparation commit before I start on the bioqueue lockdown:
Collect all the bits of bioqueue handing in subr_disk.c, vfs_bio.c is big enough as it is and disksort already lives in subr_disk.c.
|
#
110736 |
|
11-Feb-2003 |
phk |
Implement a bio-taskqueue to reduce number of context switches in disk I/O processing.
The intent is that the disk driver in its hardware interrupt routine will simply schedule the bio on the task queue with a routine to finish off whatever needs done.
The g_up thread will then schedule this routine, the likely outcome of which is a biodone() which queues the bio on g_up's regular queue where it will be picked up and processed.
Compared to the using the regular taskqueue, this saves one contextswitch.
Change our scheduling of the g_up and g_down queues to be water-tight, at the cost of breaking the userland regression test-shims.
Input and ideas from: scottl
|
#
110523 |
|
07-Feb-2003 |
phk |
Commit the correct copy of the g_stat structure.
Add debug.sizeof.g_stat sysctl.
Set the id field of the g_stat when we create consumers and providers.
Remove biocount from consumer, we will use the counters in the g_stat structure instead. Replace one field which will need to be atomically manipulated with two fields which will not (stat.nop and stat.nend).
Change add companion field to bio_children: bio_inbed for the exact same reason.
Don't output the biocount in the confdot output.
Fix KASSERT in g_io_request().
Add sysctl kern.geom.collectstats defaulting to off.
Collect the following raw statistics conditioned on this sysctl:
for each consumer and provider { total number of operations started. total number of operations completed. time last operation completed. sum of idle-time. for each of BIO_READ, BIO_WRITE and BIO_DELETE { number of operations completed. number of bytes completed. number of ENOMEM errors. number of other errors. sum of transaction time. } }
API for getting hold of these statistics data not included yet.
|
#
110517 |
|
07-Feb-2003 |
phk |
Rename bio_linkage to the more obvious bio_parent. Add bio_t0 timestamp, and include <sys/time.h> where needed
|
#
110230 |
|
02-Feb-2003 |
phk |
Add a bio_disk pointer for use between geom_disk and the device drivers.
|
#
110101 |
|
30-Jan-2003 |
phk |
NO_GEOM cleanup: Rip out iodone_chain with a big smile on my face.
|
#
104700 |
|
09-Oct-2002 |
phk |
Add a field for tallying the number of spawned bio's a bio has.
Rearrange and comment some GEOM related fields.
Sponsored by: DARPA & NAI Labs.
|
#
103683 |
|
20-Sep-2002 |
phk |
Make FreeBSD "struct disklabel" agnostic, step 312 of 723:
Rename bioqdisksort() to bioq_disksort(). Keep a #define around to avoid changing all diskdrivers right now.
Move it from subr_disklabel.c to subr_disk.c. Move prototype from <sys/disklabel.h> to <sys/bio.h>
Sponsored by: DARPA and NAI Labs.
|
#
103342 |
|
15-Sep-2002 |
phk |
Remove the unused _bio_buf field. I can't even remember if this ever got to be used in the first place.
Spotted by: bde
|
#
103330 |
|
14-Sep-2002 |
phk |
Un-inline the non-trivial "trivial" bio* functions. Untangle devstat_end_transaction_bio()
|
#
103281 |
|
13-Sep-2002 |
phk |
Oops, broke the build there. Uninline biodone() now that it is non-trivial.
Introduce biowait() function. Currently there is a race condition and the mitigation is a timeout/retry. It is not obvious what kind of locking (if any) is suitable for BIO_DONE, since the majority of users take are of this themselves, and only a few places actually rely on the wakeup.
Sponsored by: DARPA & NAI Labs.
|
#
103280 |
|
13-Sep-2002 |
phk |
Make biodone() default to wakeup() on the struct bio if no bio_done method was specified.
|
#
102956 |
|
05-Sep-2002 |
bde |
Forward declare struct uio so that <sys/uio.h> isn't a prerequisite.
Removed bogus forward declarations of structs.
|
#
96572 |
|
14-May-2002 |
phk |
Make daddr_t and u_daddr_t 64bits wide. Retire daddr64_t and use daddr_t instead.
Sponsored by: DARPA & NAI Labs.
|
#
94281 |
|
09-Apr-2002 |
phk |
Constifixion of bio_attribute.
|
#
93238 |
|
26-Mar-2002 |
phk |
Push BIO_FORMAT into a local hack inside the floppy drivers where it belongs.
|
#
93008 |
|
23-Mar-2002 |
bde |
Fixed some style bugs in the removal of __P(()). The main ones were not removing tabs before "__P((", and not outdenting continuation lines to preserve non-KNF lining up of code with parentheses. Switch to KNF formatting and/or rewrap the whole prototype in some cases.
|
#
92719 |
|
19-Mar-2002 |
alfred |
Remove __P
|
#
92363 |
|
15-Mar-2002 |
mckusick |
Introduce the new 64-bit size disk block, daddr64_t. Change the bio and buffer structures to have daddr64_t bio_pblkno, b_blkno, and b_lblkno fields which allows access to disks larger than a Terabyte in size. This change also requires that the VOP_BMAP vnode operation accept and return daddr64_t blocks. This delta should not affect system operation in any way. It merely sets up the necessary interfaces to allow the development of disk drivers that work with these larger disk block addresses. It also allows for the development of UFS2 which will use 64-bit block addresses.
|
#
92076 |
|
11-Mar-2002 |
phk |
Augment struct bio for GEOM.
|
#
91063 |
|
22-Feb-2002 |
phk |
GC: BIO_ORDERED, various infrastructure dealing with BIO_ORDERED.
|
#
78968 |
|
29-Jun-2001 |
joerg |
Define BIO_CMD{1,2}, available for local hacks, similar to the already existing BIO_FLAG{1,2}. To be used in the fdc(4) driver soon.
|
#
76322 |
|
06-May-2001 |
phk |
Actually biofinish(struct bio *, struct devstat *, int error) is more general than the bioerror().
Most of this patch is generated by scripts.
|
#
76321 |
|
06-May-2001 |
phk |
Introduce bioerror(struct bio*, int err, int complete);
|
#
71041 |
|
14-Jan-2001 |
phk |
A bit of sanity-checking in bioqdisksort(): panic if we recurse.
|
#
60938 |
|
26-May-2000 |
jake |
Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen.
Requested by: msmith and others
|
#
60833 |
|
23-May-2000 |
jake |
Change the way that the queue(3) structures are declared; don't assume that the type argument to *_HEAD and *_ENTRY is a struct.
Suggested by: phk Reviewed by: phk Approved by: mdodd
|
#
60041 |
|
05-May-2000 |
phk |
Separate the struct bio related stuff out of <sys/buf.h> into <sys/bio.h>.
<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes.
Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data.
Still a few bogus uses of struct buf to track down.
Repocopy by: peter
|
#
59915 |
|
03-May-2000 |
phk |
Convert the vm_pager_strategy() interface to take a struct bio instead of a struct buf. Don't try to examine B_ASYNC, it is a layering violation to do so. The only current user of this interface is vn(4) which, since it emulates a disk interface, operates on struct bio already.
|
#
59840 |
|
01-May-2000 |
phk |
Give struct bio it's own call back mechanism.
|
#
59762 |
|
29-Apr-2000 |
phk |
s/biowait/bufwait/g
Prodded by: several.
|
#
59623 |
|
25-Apr-2000 |
phk |
Clone the {b|bio}_offset field, and make sure it is always initialized in struct bio. Eventually, bio_offset will probably obsolete the bio_blkno and bio_pblkno fields.
Remove the special hack in atapi-cd.c to determine of bio_offset was valid.
|
#
59358 |
|
18-Apr-2000 |
phk |
Don't declare common variables in include files: move buftimelock til vfs_bio.c where it is initialized.
|
#
59249 |
|
15-Apr-2000 |
phk |
Complete the bio/buf divorce for all code below devfs::strategy
Exceptions: Vinum untouched. This means that it cannot be compiled. Greg Lehey is on the case.
CCD not converted yet, casts to struct buf (still safe)
atapi-cd casts to struct buf to examine B_PHYS
|
#
58942 |
|
02-Apr-2000 |
phk |
Clone bio versions of certain bits of infrastructure: devstat_end_transaction_bio() bioq_* versions of bufq_* incl bioqdisksort() the corresponding "buf" versions will disappear when no longer used.
Move b_offset, b_data and b_bcount to struct bio.
Add BIO_FORMAT as a hack for fd.c etc.
We are now largely ready to start converting drivers to use struct bio instead of struct buf.
|
#
58934 |
|
02-Apr-2000 |
phk |
Move B_ERROR flag to b_ioflags and call it BIO_ERROR.
(Much of this done by script)
Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.
Move b_pblkno and b_iodone_chain to struct bio while we transition, they will be obsoleted once bio structs chain/stack.
Add bio_queue field for struct bio aware disksort.
Address a lot of stylistic issues brought up by bde.
|
#
58926 |
|
02-Apr-2000 |
phk |
Draw the outline of "struct bio".
Struct bio is the future carrier of I/O requests for "struct buf".
|
#
58909 |
|
01-Apr-2000 |
dillon |
Change the write-behind code to take more care when starting async I/O's. The sequential read heuristic has been extended to cover writes as well. We continue to call cluster_write() normally, thus blocks in the file will still be reallocated for large (but still random) I/O's, but I/O will only be initiated for truely sequential writes.
This solves a number of annoying situations, especially with DBM (hash method) writes, and also has the side effect of fixing a number of (stupid) benchmarks.
Reviewed-by: mckusick
|
#
58349 |
|
20-Mar-2000 |
phk |
Rename the existing BUF_STRATEGY() to DEV_STRATEGY()
substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)
substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)
This patch is machine generated except for the ccd.c and buf.h parts.
|
#
58345 |
|
20-Mar-2000 |
phk |
Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new field in struct buf: b_iocmd. The b_iocmd is enforced to have exactly one bit set.
B_WRITE was bogusly defined as zero giving rise to obvious coding mistakes.
Also eliminate the redundant struct buf flag B_CALL, it can just as efficiently be done by comparing b_iodone to NULL.
Should you get a panic or drop into the debugger, complaining about "b_iocmd", don't continue. It is likely to write on your disk where it should have been reading.
This change is a step in the direction towards a stackable BIO capability.
A lot of this patch were machine generated (Thanks to style(9) compliance!)
Vinum users: Greg has not had time to test this yet, be careful.
|
#
55697 |
|
09-Jan-2000 |
mckusick |
Several performance improvements for soft updates have been added: 1) Fastpath deletions. When a file is being deleted, check to see if it was so recently created that its inode has not yet been written to disk. If so, the delete can proceed to immediately free the inode. 2) Background writes: No file or block allocations can be done while the bitmap is being written to disk. To avoid these stalls, the bitmap is copied to another buffer which is written thus leaving the original available for futher allocations. 3) Link count tracking. Constantly track the difference in i_effnlink and i_nlink so that inodes that have had no change other than i_effnlink need not be written. 4) Identify buffers with rollback dependencies so that the buffer flushing daemon can choose to skip over them.
|
#
55205 |
|
29-Dec-1999 |
peter |
Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL" is an application space macro and the applications are supposed to be free to use it as they please (but cannot). This is consistant with the other BSD's who made this change quite some time ago. More commits to come.
|
#
54989 |
|
22-Dec-1999 |
mckusick |
Prettyness police: Identify flags in b_xflags with BX_ to distinguish them from flags in b_flags which are prefixed with B_
|
#
54480 |
|
12-Dec-1999 |
dillon |
Synopsis of problem being fixed: Dan Nelson originally reported that blocks of zeros could wind up in a file written to over NFS by a client. The problem only occurs a few times per several gigabytes of data. This problem turned out to be bug #3 below.
bug #1:
B_CLUSTEROK must be cleared when an NFS buffer is reverted from stage 2 (ready for commit rpc) to stage 1 (ready for write). Reversions can occur when a dirty NFS buffer is redirtied with new data.
Otherwise the VFS/BIO system may end up thinking that a stage 1 NFS buffer is clusterable. Stage 1 NFS buffers are not clusterable.
bug #2:
B_CLUSTEROK was inappropriately set for a 'short' NFS buffer (short buffers only occur near the EOF of the file). Change to only set when the buffer is a full biosize (usually 8K). This bug has no effect but should be fixed in -current anyway. It need not be backported.
bug #3:
B_NEEDCOMMIT was inappropriately set in nfs_flush() (which is typically only called by the update daemon). nfs_flush() does a multi-pass loop but due to the lack of vnode locking it is possible for new buffers to be added to the dirtyblkhd list while a flush operation is going on. This may result in nfs_flush() setting B_NEEDCOMMIT on a buffer which has *NOT* yet gone through its stage 1 write, causing only the commit rpc to be made and thus causing the contents of the buffer to be thrown away (never sent to the server).
The patch also contains some cleanup, which only applies to the commit into -current.
Reviewed by: dg, julian Originally Reported by: Dan Nelson <dnelson@emsphone.com>
|
#
54388 |
|
10-Dec-1999 |
phk |
Remove the B_BAD buffer flag, it is no longer used.
|
#
53304 |
|
17-Nov-1999 |
phk |
"b_unused1" was.
Fix comment for b_caller[12] fields.
Spotted by: grog
|
#
52452 |
|
24-Oct-1999 |
dillon |
Adjust the buffer cache to better handle small-memory machines. A slightly older version of this code was tested by BDE and I.
Also fixes a lockup situation when kva gets too fragmented.
Remove the maxvmiobufspace variable and sysctl, they are no longer used. Also cleanup (remove) #if 0 sections from prior commits.
This code is more of a hack, but presumably the whole buffer cache implementation is going to be rewritten in the next year so it's no big deal.
|
#
52066 |
|
09-Oct-1999 |
phk |
Give physio a makeover.
- Let physio take read/write compatible args and have it use uio->uio_rw to determine the direction.
- physread/physwrite are now #defines for physio
- Remove the inversly named minphys(), dev->si_iosize_max takes over.
- Physio() always uses pbufs.
- Fix the check for non page-aligned transfers, now only unaligned transfers larger than (MAXPHYS - PAGE_SIZE) get fragmented (only interesting for tapes using max blocksize).
- General wash-and-clean of code.
Constructive input from: bde
|
#
51702 |
|
26-Sep-1999 |
dillon |
Fix process p_locks accounting. Conversions of the owner to LK_KERNPROC caused p_locks to be improperly accounted.
Submitted by: Tor.Egge@fast.no
|
#
50477 |
|
27-Aug-1999 |
peter |
$Id$ -> $FreeBSD$
|
#
49771 |
|
14-Aug-1999 |
phk |
Spring cleaning around strategy and disklabels/slices:
Introduce BUF_STRATEGY(struct buf *, int flag) macro, and use it throughout. please see comment in sys/conf.h about the flag argument.
Remove strategy argument from all the diskslice/label/bad144 implementations, it should be found from the dev_t.
Remove bogus and unused strategy1 routines.
Remove open/close arguments from dssize(). Pick them up from dev_t.
Remove unused and unfinished setgeom support from diskslice/label/bad144 code.
|
#
48710 |
|
09-Jul-1999 |
peter |
bufhashinit() is called with a caddr_t and is expected to return the same in both the alpha and i386 ports.
|
#
48677 |
|
08-Jul-1999 |
mckusick |
These changes appear to give us benefits with both small (32MB) and large (1G) memory machine configurations. I was able to run 'dbench 32' on a 32MB system without bring the machine to a grinding halt.
* buffer cache hash table now dynamically allocated. This will have no effect on memory consumption for smaller systems and will help scale the buffer cache for larger systems.
* minor enhancement to pmap_clearbit(). I noticed that all the calls to it used constant arguments. Making it an inline allows the constants to propogate to deeper inlines and should produce better code.
* removal of inherent vfs_ioopt support through the emplacement of appropriate #ifdef's, with John's permission. If we do not find a use for it by the end of the year we will remove it entirely.
* removal of getnewbufloops* counters & sysctl's - no longer necessary for debugging, getnewbuf() is now optimal.
* buffer hash table functions removed from sys/buf.h and localized to vfs_bio.c
* VFS_BIO_NEED_DIRTYFLUSH flag and support code added ( bwillwrite() ), allowing processes to block when too many dirty buffers are present in the system.
* removal of a softdep test in bdwrite() that is no longer necessary now that bdwrite() no longer attempts to flush dirty buffers.
* slight optimization added to bqrelse() - there is no reason to test for available buffer space on B_DELWRI buffers.
* addition of reverse-scanning code to vfs_bio_awrite(). vfs_bio_awrite() will attempt to locate clusterable areas in both the forward and reverse direction relative to the offset of the buffer passed to it. This will probably not make much of a difference now, but I believe we will start to rely on it heavily in the future if we decide to shift some of the burden of the clustering closer to the actual I/O initiation.
* Removal of the newbufcnt and lastnewbuf counters that Kirk added. They do not fix any race conditions that haven't already been fixed by the gbincore() test done after the only call to getnewbuf(). getnewbuf() is a static, so there is no chance of it being misused by other modules. ( Unless Kirk can think of a specific thing that this code fixes. I went through it very carefully and didn't see anything ).
* removal of VOP_ISLOCKED() check in flushbufqueues(). I do not think this check is necessary, the buffer should flush properly whether the vnode is locked or not. ( yes? ).
* removal of extra arguments passed to getnewbuf() that are not necessary.
* missed cluster_wbuild() that had to be a cluster_wbuild_wb() in vfs_cluster.c
* vn_write() now calls bwillwrite() *PRIOR* to locking the vnode, which should greatly aid flushing operations in heavy load situations - both the pageout and update daemons will be able to operate more efficiently.
* removal of b_usecount. We may add it back in later but for now it is useless. Prior implementations of the buffer cache never had enough buffers for it to be useful, and current implementations which make more buffers available might not benefit relative to the amount of sophistication required to implement a b_usecount. Straight LRU should work just as well, especially when most things are VMIO backed. I expect that (even though John will not like this assumption) directories will become VMIO backed some point soon.
Submitted by: Matthew Dillon <dillon@backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
|
#
48544 |
|
03-Jul-1999 |
mckusick |
The buffer queue mechanism has been reformulated. Instead of having QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN, QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean and dirty buffers have been separated. Empty buffers with KVM assignments have been separated from truely empty buffers. getnewbuf() has been rewritten and now operates in a 100% optimal fashion. That is, it is able to find precisely the right kind of buffer it needs to allocate a new buffer, defragment KVM, or to free-up an existing buffer when the buffer cache is full (which is a steady-state situation for the buffer cache).
Buffer flushing has been reorganized. Previously buffers were flushed in the context of whatever process hit the conditions forcing buffer flushing to occur. This resulted in processes blocking on conditions unrelated to what they were doing. This also resulted in inappropriate VFS stacking chains due to multiple processes getting stuck trying to flush dirty buffers or due to a single process getting into a situation where it might attempt to flush buffers recursively - a situation that was only partially fixed in prior commits. We have added a new daemon called the buf_daemon which is responsible for flushing dirty buffers when the number of dirty buffers exceeds the vfs.hidirtybuffers limit. This daemon attempts to dynamically adjust the rate at which dirty buffers are flushed such that getnewbuf() calls (almost) never block.
The number of nbufs and amount of buffer space is now scaled past the 8MB limit that was previously imposed for systems with over 64MB of memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed somewhat. The number of physical buffers has been increased with the intention that we will manage physical I/O differently in the future.
reassignbuf previously attempted to keep the dirtyblkhd list sorted which could result in non-deterministic operation under certain conditions, such as when a large number of dirty buffers are being managed. This algorithm has been changed. reassignbuf now keeps buffers locally sorted if it can do so cheaply, and otherwise gives up and adds buffers to the head of the dirtyblkhd list. The new algorithm is deterministic but not perfect. The new algorithm greatly reduces problems that previously occured when write_behind was turned off in the system.
The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive P_BUFEXHAUST bit. This bit allows processes working with filesystem buffers to use available emergency reserves. Normal processes do not set this bit and are not allowed to dig into emergency reserves. The purpose of this bit is to avoid low-memory deadlocks.
A small race condition was fixed in getpbuf() in vm/vm_pager.c.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
|
#
48333 |
|
29-Jun-1999 |
peter |
Hopefully fix the remaining glitches with the BUF_*() changes. This should (really this time) fix pageout to swap and a couple of clustering cases.
This simplifies BUF_KERNPROC() so that it unconditionally reassigns the lock owner rather than testing B_ASYNC and having the caller decide when to do the reassign. At present this is required because some places use B_CALL/b_iodone to free the buffers without B_ASYNC being set. Also, vfs_cluster.c explicitly calls BUF_KERNPROC() when attaching the buffers rather than the parent walking the cluster_head tailq.
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
|
#
48273 |
|
27-Jun-1999 |
peter |
The BUF_*() routines must be internally splbio() protected otherwise they can cause a biodone() from a disk interrupt to spin when the interrupt code tries to grab the simplelock. Masking BIO here means buftimelock and/or lk->lk_interlock shouldn't be held when an interrupt tries to grab them.
|
#
48267 |
|
27-Jun-1999 |
peter |
Make SMP work again. lockmgr() needed to be told to free the buftimelock interlock.
|
#
48247 |
|
26-Jun-1999 |
peter |
Quick fix to make libcam compile.. I don't know about the rest of world yet.
|
#
48225 |
|
26-Jun-1999 |
mckusick |
Convert buffer locking from using the B_BUSY and B_WANTED flags to using lockmgr locks. This commit should be functionally equivalent to the old semantics. That is, all buffer locking is done with LK_EXCLUSIVE requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will be done in future commits.
|
#
46625 |
|
07-May-1999 |
phk |
Introduce two functions: physread() and physwrite() and use these directly in *devsw[] rather than the 46 local copies of the same functions.
(grog will do the same for vinum when he has time)
|
#
46580 |
|
06-May-1999 |
phk |
remove b_proc from struct buf, it's (now) unused.
Reviewed by: dillon, bde
|
#
46566 |
|
06-May-1999 |
phk |
Remove unused fields from struct buf: b_savekva b_validoff b_validend
Reviewed by: dillon, bde
|
#
46349 |
|
02-May-1999 |
alc |
The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly.
There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
|
#
44679 |
|
12-Mar-1999 |
julian |
Reviewed by: Many at differnt times in differnt parts, including alan, john, me, luoqi, and kirk Submitted by: Matt Dillon <dillon@frebsd.org>
This change implements a relatively sophisticated fix to getnewbuf(). There were two problems with getnewbuf(). First, the writerecursion can lead to a system stack overflow when you have NFS and/or VN devices in the system. Second, the free/dirty buffer accounting was completely broken. Not only did the nfs routines blow it trying to manually account for the buffer state, but the accounting that was done did not work well with the purpose of their existance: figuring out when getnewbuf() needs to sleep.
The meat of the change is to kern/vfs_bio.c. The remaining diffs are all minor except for NFS, which includes both the fixes for bp interaction AND fixes for a 'biodone(): buffer already done' lockup. Sys/buf.h also contains a chaining structure which is not used by this patchset but is used by other patches that are coming soon. This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF. (sorry for the missing T matt)
|
#
44391 |
|
02-Mar-1999 |
mckusick |
When fsync'ing a file on a filesystem using soft updates, we first try to write all the dirty blocks. If some of those blocks have dependencies, they will be remarked dirty when the I/O completes. On systems with really fast I/O systems, it is possible to get in an infinite loop trying to flush the buffers, because the I/O finishes before we can get all the dirty buffers off the v_dirtyblkhd list and into the I/O queue. (The previous algorithm looped over the v_dirtyblkhd list writing out buffers until the list emptied.) So, now we mark each buffer that we try to write so that we can distinguish the ones that are being remarked dirty from those that we have not yet tried to flush. Once we have tried to push every buffer once, we then push any associated metadata that is causing the remaining buffers to be redirtied.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
|
#
42980 |
|
21-Jan-1999 |
peter |
Typo: s/bpreassignbuf/pbreassignbuf/ so the prototype matches it's function
|
#
42957 |
|
21-Jan-1999 |
dillon |
This is a rather large commit that encompasses the new swapper, changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues.
Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
|
#
41124 |
|
12-Nov-1998 |
dg |
Restored the "reallocblks" code to its former glory. What this does is basically do a on-the-fly defragmentation of the FFS filesystem, changing file block allocations to make them contiguous. Thanks to Kirk McKusick for providing hints on what needed to be done to get this working.
|
#
40786 |
|
31-Oct-1998 |
peter |
Convert the vnode clean/dirty attached buffer lists from LISTs to TAILQs. Add a new flags field (we get this for free because of struct packing) for cleaner management of tailq membership. We had two spare b_flags slots, but they are a precious resource and may be needed for other things that are related to other b_flags bits. The two new flags are convenient to use in a seperate location.
Reviewed (in principle) by: dg Obtained from: John Dyson's old work-in-progress
|
#
39924 |
|
03-Oct-1998 |
bde |
Quick fix for not being able to sync all the buffers in boot() if an ext2fs file system is mounted. The soft update changes added a check for B_DELWRI buffers. This exposed the complete brokenness of the previous quick fix for failing syncs (PR 3571, committed on 1997/08/04). Use a new buffer flag B_DIRTY and don't abuse B_DELWRI. B_DIRTY buffers are still written too late, as broken in the previous fix. This is fairly harmless, because B_DIRTY is only used for bitmap buffers and fsck.ext2 can fix up the bitmaps perfectly.
Fixed a race in ULCK_BUF() (bremfree() was outside of the splbio() section).
|
#
39648 |
|
25-Sep-1998 |
peter |
Goodbye BOUNCE_BUFFERS, for a hack it has served us well.
The last consumer of this code (the old SCSI system) has left us and the CAM code does it's own bouncing. The isa dma system has been doing it's own bouncing for a while too.
Reviewed by: core
|
#
39623 |
|
24-Sep-1998 |
luoqi |
Eliminate a race in VOP_FSYNC() when softupdates is enabled. Submitted by: Kirk McKusick <mckusick@McKusick.COM> Two minor changes are also included, 1. Remove gratuitious checks for error return from vn_lock with LK_RETRY set, vn_lock should always succeed in these cases. 2. Back out change rev. 1.36->1.37, which unnecessarily makes async mount a little more unstable. It also keeps us in sync with other BSDs. Suggested by: Bruce Evans <bde@zeta.org.au>
|
#
39238 |
|
15-Sep-1998 |
gibbs |
When a buffer is removed from a buffer queue, remember it's block number and use it as "the currently active" buffer in doing disk sort calculations.
|
#
38862 |
|
05-Sep-1998 |
phk |
Add a new vnode op, VOP_FREEBLKS(), which filesystems can use to inform device drivers about sectors no longer in use.
Device-drivers receive the call through d_strategy, if they have D_CANFREE in d_flags.
This allows flash based devices to erase the sectors and avoid pointlessly carrying them around in compactions.
Reviewed by: Kirk Mckusick, bde Sponsored by: M-Systems (www.m-sys.com)
|
#
38524 |
|
24-Aug-1998 |
phk |
Remove the last remaining evidence of B_TAP. Reclaim 3 unused bits in b_flags
|
#
37653 |
|
15-Jul-1998 |
bde |
Cast pointers to [u]intptr_t instead of to [unsigned] long.
|
#
36022 |
|
13-May-1998 |
gibbs |
Fix bogus "cleanup" in bufq_remove. The "switch point" for tqdisksort was getting mangled.
Submitted by: Bruce Evans <bde@zeta.org.au>
|
#
35766 |
|
05-May-1998 |
gibbs |
Now that we have a TAILQ_PREV() that returns the previous object, simplify some of the buf_queue inline functions.
|
#
34924 |
|
28-Mar-1998 |
bde |
Moved some #includes from <sys/param.h> nearer to where they are actually used.
|
#
34695 |
|
19-Mar-1998 |
dyson |
Remove b_generation.
|
#
34611 |
|
15-Mar-1998 |
dyson |
Some VM improvements, including elimination of alot of Sig-11 problems. Tor Egge and others have helped with various VM bugs lately, but don't blame him -- blame me!!!
pmap.c: 1) Create an object for kernel page table allocations. This fixes a bogus allocation method previously used for such, by grabbing pages from the kernel object, using bogus pindexes. (This was a code cleanup, and perhaps a minor system stability issue.)
pmap.c: 2) Pre-set the modify and accessed bits when prudent. This will decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c: 3) Rather than calculating the beginning virtual byte offset multiple times, stick the offset into the buffer header, so that the calculated offset can be reused. (Long long multiplies are often expensive, and this is a probably unmeasurable performance improvement, and code cleanup.)
vfs_bio.c: 4) Handle write recursion more intelligently (but not perfectly) so that it is less likely to cause a system panic, and is also much more robust.
vfs_bio.c: 5) getblk incorrectly wrote out blocks that are incorrectly sized. The problem is fixed, and writes blocks out ONLY when B_DELWRI is true.
vfs_bio.c: 6) Check that already constituted buffers have fully valid pages. If not, then make sure that the B_CACHE bit is not set. (This was a major source of Sig-11 type problems.)
vfs_bio.c: 7) Fix a potential system deadlock due to an incorrectly specified sleep priority while waiting for a buffer write operation. The change that I made opens the system up to serious problems, and we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c: 8) Make clustered reads work more correctly (and more completely) when buffers are already constituted, but not fully valid. (This was another system reliability issue.)
vfs_subr.c, ffs_inode.c: 9) Create a vtruncbuf function, which is used by filesystems that can truncate files. The vinvalbuf forced a file sync type operation, while vtruncbuf only invalidates the buffers past the new end of file, and also invalidates the appropriate pages. (This was a system reliabiliy and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c: 11) Make the object rundown mechanism for OBJT_VNODE type objects work more correctly. Included in that fix, create pager entries for the OBJT_DEAD pager type, so that paging requests that might slip in during race conditions are properly handled. (This was a system reliability issue.)
vm_page.c: 12) Make some of the page validation routines be a little less picky about arguments passed to them. Also, support page invalidation change the object generation count so that we handle generation counts a little more robustly.
vm_pageout.c: 13) Further reduce pageout daemon activity when the system doesn't need help from it. There should be no additional performance decrease even when the pageout daemon is running. (This was a significant performance issue.)
vnode_pager.c: 14) Teach the vnode pager to handle race conditions during vnode deallocations.
|
#
34266 |
|
08-Mar-1998 |
julian |
Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman) Submitted by: Kirk McKusick (mcKusick@mckusick.com) Obtained from: WHistle development tree
|
#
34206 |
|
07-Mar-1998 |
dyson |
This mega-commit is meant to fix numerous interrelated problems. There has been some bitrot and incorrect assumptions in the vfs_bio code. These problems have manifest themselves worse on NFS type filesystems, but can still affect local filesystems under certain circumstances. Most of the problems have involved mmap consistancy, and as a side-effect broke the vfs.ioopt code. This code might have been committed seperately, but almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that are fully valid. 2) Rather than deactivating erroneously read initial (header) pages in kern_exec, we now free them. 3) Fix the rundown of non-VMIO buffers that are in an inconsistent (missing vp) state. 4) Fix the disassociation of pages from buffers in brelse. The previous code had rotted and was faulty in a couple of important circumstances. 5) Remove a gratuitious buffer wakeup in vfs_vmio_release. 6) Remove a crufty and currently unused cluster mechanism for VBLK files in vfs_bio_awrite. When the code is functional, I'll add back a cleaner version. 7) The page busy count wakeups assocated with the buffer cache usage were incorrectly cleaned up in a previous commit by me. Revert to the original, correct version, but with a cleaner implementation. 8) The cluster read code now tries to keep data associated with buffers more aggressively (without breaking the heuristics) when it is presumed that the read data (buffers) will be soon needed. 9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The delay loop waiting is not useful for filesystem locks, due to the length of the time intervals. 10) Correct and clean-up spec_getpages. 11) Implement a fully functional nfs_getpages, nfs_putpages. 12) Fix nfs_write so that modifications are coherent with the NFS data on the server disk (at least as well as NFS seems to allow.) 13) Properly support MS_INVALIDATE on NFS. 14) Properly pass down MS_INVALIDATE to lower levels of the VM code from vm_map_clean. 15) Better support the notion of pages being busy but valid, so that fewer in-transit waits occur. (use p->busy more for pageouts instead of PG_BUSY.) Since the page is fully valid, it is still usable for reads. 16) It is possible (in error) for cached pages to be busy. Make the page allocation code handle that case correctly. (It should probably be a printf or panic, but I want the system to handle coding errors robustly. I'll probably add a printf.) 17) Correct the design and usage of vm_page_sleep. It didn't handle consistancy problems very well, so make the design a little less lofty. After vm_page_sleep, if it ever blocked, it is still important to relookup the page (if the object generation count changed), and verify it's status (always.) 18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up. 19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush. 20) Fix vm_pager_put_pages and it's descendents to support an int flag instead of a boolean, so that we can pass down the invalidate bit.
|
#
32702 |
|
22-Jan-1998 |
dyson |
VM level code cleanups.
1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
|
#
31493 |
|
02-Dec-1997 |
phk |
In all such uses of struct buf: 's/b_un.b_addr/b_data/g'
|
#
30670 |
|
23-Oct-1997 |
gibbs |
Fix a potentially disasterous '==' instead of '=' bug.
Submitted by: John-Mark Gurney <gurney_j@resnet.uoregon.edu>
|
#
29683 |
|
21-Sep-1997 |
gibbs |
buf.h: Change the definition of a buffer queue so that bufqdisksort can properly deal with bordered writes.
Add inline functions for accessing buffer queues. This should be considered an opaque data structure by clients.
callout.h: New callout implementation.
device.h: Add support for CAM interrupts.
disk.h: disklabel.h: tqdisksort->bufqdisksort
kernel.h: Add new configuration entries for configuration hooks and calling cpu_rootconf and cpu_dumpconf.
param.h: Add a priority for sleeping waiting on config hooks.
proc.h: Update for new callout implementation.
queue.h: Add TAILQ_HEAD_INITIALIZER from NetBSD.
systm.h: Add prototypes for cpu_root/dumpconf, splcam, splsoftcam, etc..
|
#
29506 |
|
16-Sep-1997 |
bde |
Fixed gratuitous ANSIisms.
|
#
29211 |
|
07-Sep-1997 |
bde |
Some staticized variables were still declared to be extern.
|
#
26664 |
|
15-Jun-1997 |
dyson |
Fix a problem with the VN device. Specifically, the VN device can cause a problem of spiraling death due to buffer resource limitations. The vfs_bio code in general had little ability to handle buffer resource management, and now it does. Also, there are a lot more knobs for tuning the vfs_bio code now. The knobs came free because of the need that there always be some immediately available buffers (non-delayed or locked) for use. Note that the buffer cache code is much less likely to get bogged down with lots of delayed writes, even more so than before.
|
#
22975 |
|
22-Feb-1997 |
peter |
Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
|
#
21673 |
|
14-Jan-1997 |
jkh |
Make the long-awaited change from $Id$ to $FreeBSD$
This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
|
#
21002 |
|
29-Dec-1996 |
dyson |
This commit is the embodiment of some VFS read clustering improvements. Firstly, now our read-ahead clustering is on a file descriptor basis and not on a per-vnode basis. This will allow multiple processes reading the same file to take advantage of read-ahead clustering. Secondly, there previously was a problem with large reads still using the ramp-up algorithm. Of course, that was bogus, and now we read the entire "chunk" off of the disk in one operation. The read-ahead clustering algorithm should use less CPU than the previous also (I hope :-)).
NOTE: THAT LKMS MUST BE REBUILT!!!
|
#
20054 |
|
30-Nov-1996 |
dyson |
Implement a new totally dynamic (up to MAXPHYS) buffer kva allocation scheme. Additionally, add the capability for checking for unexpected kernel page faults. The maximum amount of kva space for buffers hasn't been decreased from where it is, but it will now be possible to do so.
This scheme manages the kva space similar to the buffers themselves. If there isn't enough kva space because of usage or fragementation, buffers will be reclaimed until a buffer allocation is successful. This scheme should be very resistant to fragmentation problems until/if the LFS code is fixed and uses the bogus buffer locking scheme -- but a 'fixed' LFS is not likely to use such a scheme.
Now there should be NO problem allocating buffers up to MAXPHYS.
|
#
18910 |
|
13-Oct-1996 |
phk |
Remove some old compatibility names.
|
#
18068 |
|
06-Sep-1996 |
gibbs |
Add B_ORDERED buffer flag and prototype for the bowrite function.
Bowrite guarantees that buffers queued after a call to bowrite will be written after the specified buffer (to a particular device). Bowrite does this either by taking advantage of hardware ordering support (e.g. tagged queueing on SCSI devices) or by resorting to a synchronous write.
|
#
15582 |
|
03-May-1996 |
phk |
Remove buf->b_actf, nobody uses it anymore. Clean up some pmap macro usage.
|
#
15493 |
|
01-May-1996 |
bde |
Removed bogus _BEGIN_DECLS/_END_DECLS.
Removed unused struct tag declarations in cloned code.
Added or cleaned up idempotency ifdefs.
|
#
15125 |
|
07-Apr-1996 |
phk |
remove b_actb, it's not used anywhere.
|
#
14476 |
|
11-Mar-1996 |
hsu |
Merge in Lite2: rename B_APPENDWRITE to B_NEEDCOMMIT. The other change to b_pfcent is no longer pertinent, since we've deleted that field. Reviewed by: davidg & bde
|
#
13765 |
|
30-Jan-1996 |
mpp |
Fix a bunch of spelling errors in the comment fields of a bunch of system include files.
|
#
13490 |
|
19-Jan-1996 |
dyson |
Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.
|
#
13086 |
|
28-Dec-1995 |
dg |
Made bzero a function vector and added a 586/686 optimized version of bzero. Deprecated blkclr (removed it). Removed some old cruft from cpufunc.h.
The optimized bzero was submitted by Torbjorn Granlund <tege@matematik.su.se> The kernel adaption and other changes by me.
|
#
12767 |
|
11-Dec-1995 |
dyson |
Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.
|
#
12428 |
|
20-Nov-1995 |
phk |
Fully prototype physio().
|
#
12408 |
|
19-Nov-1995 |
dyson |
First set of changes to eliminate the ad-hoc device buffer queues, replacing them with TAILQ's as appropriate. The SCSI code is the first to be changed -- until the changes are complete, both b_act and b_actf will be in the buf structure. b_actf will eventually be removed.
|
#
12404 |
|
19-Nov-1995 |
dyson |
General fixes to the vfs clustring code:
1) Make cluster buffer list be a non-malloced chain. This eliminates yet another 'evil' M_WAITOK and generally cleans up the code. 2) Fix write clustering for ext2fs. It was just broken. Also, ffs clustering had an efficiency problem that more bawrites were happening than should have been. 3) Make changes to buf.h to support the above, plus remove b_pfcent at the request of David Greenman. Reviewed by: davidg (partially)
|
#
10226 |
|
24-Aug-1995 |
dg |
Improved BUFHASH algorithm.
|
#
9759 |
|
29-Jul-1995 |
bde |
Eliminate sloppy common-style declarations. There should be none left for the LINT configuation.
|
#
8876 |
|
30-May-1995 |
rgrimes |
Remove trailing whitespace.
|
#
7935 |
|
19-Apr-1995 |
dg |
New flag: B_PAGING. Added as part of the vn driver hack.
|
#
7695 |
|
09-Apr-1995 |
dg |
Changes from John Dyson and myself:
Fixed remaining known bugs in the buffer IO and VM system.
vfs_bio.c: Fixed some race conditions and locking bugs. Improved performance by removing some (now) unnecessary code and fixing some broken logic. Fixed process accounting of # of FS outputs. Properly handle NFS interrupts (B_EINTR).
(various) Replaced calls to clrbuf() with calls to an optimized routine called vfs_bio_clrbuf().
(various FS sync) Sync out modified vnode_pager backed pages.
ffs_vnops.c: Do two passes: Sync out file data first, then indirect blocks.
vm_fault.c: Fixed deadly embrace caused by acquiring locks in the wrong order.
vnode_pager.c: Changed to use buffer I/O system for writing out modified pages. This should fix the problem with the modification date previous not getting updated. Also dramatically simplifies the code. Note that this is going to change in the future and be implemented via VOP_PUTPAGES().
vm_object.c: Fixed a pile of bugs related to cleaning (vnode) objects. The performance of vm_object_page_clean() is terrible when dealing with huge objects, but this will change when we implement a binary tree to keep the object pages sorted.
vm_pageout.c: Fixed broken clustering of pageouts. Fixed race conditions and other lockup style bugs in the scanning of pages. Improved performance.
|
#
7430 |
|
28-Mar-1995 |
bde |
Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) that I didn't notice when I fixed "all" such warnings before.
|
#
7399 |
|
26-Mar-1995 |
dg |
Removed third arg (vmio) to allocbuf() that was added with the original merged cache changes, and figure it out based on the B_VMIO buffer flag. Fixes a problem where delayed write VMIO buffers would sometimes get recopied into kernel-alloced memory.
Submitted by: John Dyson
|
#
7353 |
|
25-Mar-1995 |
dg |
Removed an old VMIO #ifdef and made the type of b_pages 'struct vm_page *'.
|
#
7090 |
|
16-Mar-1995 |
bde |
Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
|
#
6549 |
|
18-Feb-1995 |
bde |
New field b_biodone_chain to support nested b_iodone's.
|
#
5455 |
|
09-Jan-1995 |
dg |
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme.
vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering.
vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption.
vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs.
vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore.
machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme.
machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers.
Submitted by: John Dyson and David Greenman
|
#
3688 |
|
18-Oct-1994 |
dg |
Removed references to bclnlist which we don't use/support/need.
|
#
3484 |
|
09-Oct-1994 |
phk |
Cosmetics. (sort of) Added 19 prototypes.
|
#
3374 |
|
05-Oct-1994 |
dg |
Stuff object into v_vmdata rather than pager. Not important which at the moment, but will be in the future. Other changes mostly cosmetic, but are made for future VMIO considerations.
Submitted by: John Dyson
|
#
3098 |
|
25-Sep-1994 |
phk |
While in the real world, I had a bad case of being swapped out for a lot of cycles. While waiting there I added a lot of the extra ()'s I have, (I have never used LISP to any extent). So I compiled the kernel with -Wall and shut up a lot of "suggest you add ()'s", removed a bunch of unused var's and added a couple of declarations here and there. Having a lap-top is highly recommended. My kernel still runs, yell at me if you kernel breaks.
|
#
2112 |
|
18-Aug-1994 |
wollman |
Fix up some sloppy coding practices:
- Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above.
NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
|
#
1887 |
|
06-Aug-1994 |
dg |
Incorporated post 1.1.5 work from John Dyson. This includes performance improvements via the new routines pmap_qenter/pmap_qremove and pmap_kenter/ pmap_kremove. These routine allow fast mapping of pages for those architectures that have "normal" MMUs. Also included is a fix to the pageout daemon to properly check a queue end condition.
Submitted by: John Dyson
|
#
1817 |
|
02-Aug-1994 |
dg |
Added $Id$
|
#
1564 |
|
26-May-1994 |
dg |
Moved header definitions to buf.h, and added missing splx() - found by Johannes Helander.
|
#
1549 |
|
25-May-1994 |
rgrimes |
The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.
Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
|
#
1541 |
|
24-May-1994 |
rgrimes |
BSD 4.4 Lite Kernel Sources
|