1110523Sphk$FreeBSD$
2110523Sphk
3110523SphkFor the lack of a better place to put them, this file will contain
4110523Sphknotes on some of the more intricate details of geom.
5110523Sphk
6110523Sphk-----------------------------------------------------------------------
7110523SphkLocking of bio_children and bio_inbed
8110523Sphk
9110523Sphkbio_children is used by g_std_done() and g_clone_bio() to keep track
10110523Sphkof children cloned off a request.  g_clone_bio will increment the
11110523Sphkbio_children counter for each time it is called and g_std_done will
12110523Sphkincrement bio_inbed for every call, and if the two counters are
13110523Sphkequal, call g_io_deliver() on the parent bio.
14110523Sphk
15110523SphkThe general assumption is that g_clone_bio() is called only in
16110523Sphkthe g_down thread, and g_std_done() only in the g_up thread and
17110523Sphktherefore the two fields do not generally need locking.  These
18110523Sphkrestrictions are not enforced by the code, but only with great
19110523Sphkcare should they be violated.
20110523Sphk
21110523SphkIt is the responsibility of the class implementation to avoid the
22110523Sphkfollowing race condition:  A class intend to split a bio in two
23110523Sphkchildren.  It clones the bio, and requests I/O on the child. 
24110523SphkThis I/O operation completes before the second child is cloned
25110523Sphkand g_std_done() sees the counters both equal 1 and finishes off
26110523Sphkthe bio.
27110523Sphk
28110523SphkThere is no race present in the common case where the bio is split
29110523Sphkin multiple parts in the class start method and the I/O is requested
30110523Sphkon another GEOM class below:  There is only one g_down thread and
31110523Sphkthe class below will not get its start method run until we return
32110523Sphkfrom our start method, and consequently the I/O cannot complete
33110523Sphkprematurely.
34110523Sphk
35110523SphkIn all other cases, this race needs to be mitigated, for instance
36110523Sphkby cloning all children before I/O is request on any of them.
37110523Sphk
38110523SphkNotice that cloning an "extra" child and calling g_std_done() on
39110523Sphkit directly opens another race since the assumption is that
40110523Sphkg_std_done() only is called in the g_up thread.
41110592Sphk
42110592Sphk-----------------------------------------------------------------------
43110592SphkStatistics collection
44110592Sphk
45110592SphkStatistics collection can run at three levels controlled by the
46110592Sphk"kern.geom.collectstats" sysctl.
47110592Sphk
48110592SphkAt level zero, only the number of transactions started and completed
49110592Sphkare counted, and this is only because GEOM internally uses the difference
50110592Sphkbetween these two as sanity checks.
51110592Sphk
52110592SphkAt level one we collect the full statistics.  Higher levels are
53110592Sphkreserved for future use.  Statistics are collected independently
54110592Sphkon both the provider and the consumer, because multiple consumers
55110592Sphkcan be active against the same provider at the same time.
56110592Sphk
57110592SphkThe statistics collection falls in two parts:
58110592Sphk
59110592SphkThe first and simpler part consists of g_io_request() timestamping
60110592Sphkthe struct bio when the request is first started and g_io_deliver()
61110592Sphkupdating the consumer and providers statistics based on fields in
62110592Sphkthe bio when it is completed.  There are no concurrency or locking
63110592Sphkconcerns in this part.  The statistics collected consists of number
64110592Sphkof requests, number of bytes, number of ENOMEM errors, number of
65110592Sphkother errors and duration of the request for each of the three
66110592Sphkmajor request types: BIO_READ, BIO_WRITE and BIO_DELETE.
67110592Sphk
68110592SphkThe second part is trying to keep track of the "busy%".
69110592Sphk
70110592SphkIf in g_io_request() we find that there are no outstanding requests,
71110592Sphk(based on the counters for scheduled and completed requests being
72110592Sphkequal), we set a timestamp in the "wentbusy" field.  Since there
73110592Sphkare no outstanding requests, and as long as there is only one thread
74110592Sphkpushing the g_down queue, we cannot possibly conflict with
75110592Sphkg_io_deliver() until we ship the current request down.
76110592Sphk
77110592SphkIn g_io_deliver() we calculate the delta-T from wentbusy and add this
78110592Sphkto the "bt" field, and set wentbusy to the current timestamp.  We
79110592Sphktake care to do this before we increment the "requests completed"
80110592Sphkcounter, since that prevents g_io_request() from touching the
81110592Sphk"wentbusy" timestamp concurrently.
82110592Sphk
83110592SphkThe statistics data is made available to userland through the use
84110592Sphkof a special allocator (in geom_stats.c) which through a device
85110592Sphkallows userland to mmap(2) the pages containing the statistics data.
86110592SphkIn order to indicate to userland when the data in a statstics
87110592Sphkstructure might be inconsistent, g_io_deliver() atomically sets a
88110592Sphkflag "updating" and resets it when the structure is again consistent.
89110710Sphk-----------------------------------------------------------------------
90110710Sphkmaxsize, stripesize and stripeoffset
91110710Sphk
92110710Sphkmaxsize is the biggest request we are willing to handle.  If not
93110710Sphkset there is no upper bound on the size of a request and the code
94110710Sphkis responsible for chopping it up.  Only hardware methods should
95110710Sphkset an upper bound in this field.  Geom_disk will inherit the upper
96110710Sphkbound set by the device driver.
97110710Sphk
98110710Sphkstripesize is the width of any natural request boundaries for the
99110710Sphkdevice.  This would be the width of a stripe on a raid-5 unit or
100110710Sphkone zone in GBDE.  The idea with this field is to hint to clustering
101110710Sphktype code to not trivially overrun these boundaries.
102110710Sphk
103110710Sphkstripeoffset is the amount of the first stripe which lies before the
104110710Sphkdevices beginning.
105110710Sphk
106110710SphkIf we have a device with 64k stripes:
107110710Sphk	[0...64k[
108110710Sphk	[64k...128k[
109110710Sphk	[128k..192k[
110110710SphkThen it will have stripesize = 64k and stripeoffset = 0.
111110710Sphk
112110710SphkIf we put a MBR on this device, where slice#1 starts on sector#63,
113110710Sphkthen this slice will have: stripesize = 64k, stripeoffset = 63 * sectorsize.
114110710Sphk
115110710SphkIf the clustering code wants to widen a request which writes to
116110710Sphksector#53 of the slice, it can calculate how many bytes till the end of
117110710Sphkthe stripe as:
118110710Sphk	stripewith - (53 * sectorsize + stripeoffset) % stripewidth.
119112509Sphk-----------------------------------------------------------------------
120112509Sphk
121112509Sphk#include file usage:
122112509Sphk
123112509Sphk                 geom.h|geom_int.h|geom_ext.h|geom_ctl.h|libgeom.h
124112509Sphk----------------+------+----------+----------+----------+--------+
125112509Sphkgeom class      |      |          |          |          |        |
126112509Sphkimplementation  |   X  |          |          |          |        |
127112509Sphk----------------+------+----------+----------+----------+--------+
128112509Sphkgeom kernel     |      |          |          |          |        |
129112509Sphkinfrastructure  |   X  |      X   |  X       |    X     |        |
130112509Sphk----------------+------+----------+----------+----------+--------+
131112509Sphklibgeom         |      |          |          |          |        |
132112509Sphkimplementation  |      |          |  X       |    X     |  X     |
133112509Sphk----------------+------+----------+----------+----------+--------+
134112509Sphkgeom aware      |      |          |          |          |        |
135112509Sphkapplication     |      |          |          |    X     |  X     |
136112509Sphk----------------+------+----------+----------+----------+--------+
137112509Sphk
138112509Sphkgeom_slice.h is special in that it documents a "library" for implementing
139112509Sphka specific kind of class, and consequently does not appear in the above
140112509Sphkmatrix.
141134824Sphk-----------------------------------------------------------------------
142134824SphkRemovable media.
143134824Sphk
144134824SphkIn general, the theory is that a drive creates the provider when it has
145134824Sphka media and destroys it when the media disappears.
146134824Sphk
147134824SphkIn a more realistic world, we will allow a provider to be opened medialess
148134824Sphk(set any sectorsize and a mediasize==0) in order to allow operations like
149134824Sphkopen/close tray etc.
150134824Sphk
151