canonical_filenames.html revision 269847
1171568Sscottl<HTML>
2211095Sdes<HEAD><TITLE>APR Canonical Filenames</TITLE></HEAD>
3171568Sscottl<BODY>
4171568Sscottl<h1>APR Canonical Filename</h1>
5171568Sscottl
6171568Sscottl<h2>Requirements</h2>
7171568Sscottl
8171568Sscottl<p>APR porters need to address the underlying discrepancies between
9171568Sscottlfile systems.  To achieve a reasonable degree of security, the
10171568Sscottlprogram depending upon APR needs to know that two paths may be
11171568Sscottlcompared, and that a mismatch is guarenteed to reflect that the
12171568Sscottltwo paths do not return the same resource</p>.
13171568Sscottl
14171568Sscottl<p>The first discrepancy is in volume roots.  Unix and pure deriviates
15171568Sscottlhave only one root path, "/".  Win32 and OS2 share root paths of
16171568Sscottlthe form "D:/", D: is the volume designation.  However, this can
17171568Sscottlbe specified as "//./D:/" as well, indicating D: volume of the 
18171568Sscottl'this' machine.  Win32 and OS2 also may employ a UNC root path,
19171568Sscottlof the form "//server/share/" where share is a share-point of the
20171568Sscottlspecified network server.  Finally, NetWare root paths are of the
21171568Sscottlform "server/volume:/", or the simpler "volume:/" syntax for 'this'
22171568Sscottlmachine.  All these non-Unix file systems accept volume:path,
23171568Sscottlwithout a slash following the colon, as a path relative to the
24171568Sscottlcurrent working directory, which APR will treat as ambigious, that
25171568Sscottlis, neither an absolute nor a relative path per se.</p>
26171568Sscottl
27171568Sscottl<p>The second discrepancy is in the meaning of the 'this' directory.
28171568SscottlIn general, 'this' must be eliminated from the path where it occurs.
29171568SscottlThe syntax "path/./" and "path/" are both aliases to path.  However,
30171568Sscottlthis isn't file system independent, since the double slash "//" has
31171568Sscottla special meaning on OS2 and Win32 at the start of the path name,
32171568Sscottland is invalid on those platforms before the "//server/share/" UNC
33171568Sscottlroot path is completed.  Finally, as noted above, "//./volume/" is
34171568Sscottllegal root syntax on WinNT, and perhaps others.</p>
35171568Sscottl
36171568Sscottl<p>The third discrepancy is in the context of the 'parent' directory.
37171568SscottlWhen "parent/path/.." occurs, the path must be unwound to "parent".
38171568SscottlIt's also critical to simply truncate leading "/../" paths to "/",
39171568Sscottlsince the parent of the root is root.  This gets tricky on the
40171568SscottlWin32 and OS2 platforms, since the ".." element is invalid before
41171568Sscottlthe "//server/share/" is complete, and the "//server/share/../"
42171568Sscottlseqence is the complete UNC root "//server/share/".  In relative
43171568Sscottlpaths, leading ".." elements are significant, until they are merged
44171568Sscottlwith an absolute path.  The relative form must only retain the ".."
45171568Sscottlsegments as leading segments, to be resolved once merged to another
46171568Sscottlrelative or an absolute path.</p>
47171568Sscottl
48185289Sscottl<p>The fourth discrepancy occurs with acceptance of alternate character
49185289Sscottlcodes for the same element.  Path seperators are not retained within
50185289Sscottlthe APR canonical forms.  The OS filesystem and APR (slashed) forms
51185289Sscottlcan both be returned as strings, to be used in the proper context.
52185289SscottlUnix, Win32 and Netware all accept slashes and backslashes as the
53185289Sscottlsame path seperator symbol, although unix strictly accepts slashes.
54171568SscottlWhile the APR form of the name strictly uses slashes, always consider
55171568Sscottlthat there could be a platform that actually accepts slashes as a
56171568Sscottlcharacter within a segment name.</p>
57171568Sscottl
58171568Sscottl<p>The fifth and worst discrepancy plauges Win32, OS2, Netware, and some
59171568Sscottlfilesystems mounted in Unix.  Case insensitivity can permit the same
60171568Sscottlfile to slip through in both it's proper case and alternate cases.
61171568SscottlSimply changing the case is insufficient for any character set beyond
62171568SscottlASCII, since various dilectic forms of characters suffer from one to
63171568Sscottlmany or many to one translations.  An example would be u-umlaut, which
64171568Sscottlmight be accepted as a single character u-umlaut, a two character
65171568Sscottlsequence u and the zero-width umlaut, the upper case form of the same,
66171568Sscottlor perhaps even a captial U alone.  This can be handled in different
67171568Sscottlways depending on the purposes of the APR based program, but the one
68171568Sscottlrequirement is that the path must be absolute in order to resolve these
69171568Sscottlambiguities.  Methods employed include comparison of device and inode
70185289Sscottlfile uniqifiers, which is a fairly fast operation, or quering the OS
71171568Sscottlfor the true form of the name, which can be much slower.  Only the
72171568Sscottlacknowledgement of the file names by the OS can validate the equality
73171568Sscottlof two different cases of the same filename.</p>
74171568Sscottl
75171568Sscottl<p>The sixth discrepancy, illegal or insignificant characters, is especially 
76171568Sscottlsignificant in non-unix file systems.  Trailing periods are accepted
77171568Sscottlbut never stored, therefore trailing periods must be ignored for any
78171568Sscottlform of comparison.  And all OS's have certain expectations of what
79171568Sscottlcharacters are illegal (or undesireable due to confusion.)</p>
80171568Sscottl
81171568Sscottl<p>A final warning, canonical functions don't transform or resolve case
82171568Sscottlor character ambiguity issues until they are resolved into an absolute
83171568Sscottlpath.  The relative canonical path, while useful, while useful for URL
84171568Sscottlor similar identifiers, cannot be used for testing or comparison of file 
85171568Sscottlsystem objects.</p>
86171568Sscottl
87171568Sscottl<hr>
88171568Sscottl
89171568Sscottl<h2>Canonical API</h2>
90171568Sscottl
91171568SscottlFunctions to manipulate the apr_canon_file_t (an opaque type) include:
92171568Sscottl
93171568Sscottl<ul>
94171568Sscottl<li>Create canon_file_t (from char* path and canon_file_t parent path)
95171568Sscottl<li>Merged canon_file_t (from path and parent, both canon_file_t)
96171568Sscottl<li>Get char* path of all or some segments
97171568Sscottl<li>Get path flags of IsRelative, IsVirtualRoot, and IsAbsolute
98171568Sscottl<li>Compare two canon_file_t structures for file equality
99171568Sscottl</ul>
100171568Sscottl
101171568Sscottl<p>The path is corrected to the file system case only if is in absolute 
102171568Sscottlform.  The apr_canon_file_t should be preserved as long as possible and 
103171568Sscottlused as the parent to create child entries to reduce the number of expensive 
104171568Sscottlstat and case canonicalization calls to the OS.</p>
105171568Sscottl
106171568Sscottl<p>The comparison operation provides that the APR can postpone correction
107171568Sscottlof case by simply relying upon the device and inode for equivalence.  The
108171568Sscottlstat implementation provides that two files are the same, while their
109171568Sscottlstrings are not equivalent, and eliminates the need for the operating
110171568Sscottlsystem to return the proper form of the name.</p>
111171568Sscottl
112171568Sscottl<p>In any case, returning the char* path, with a flag to request the proper
113171568Sscottlcase, forces the OS calls to resolve the true names of each segment.  Where
114171568Sscottlthere is a penality for this operation and the stat device and inode test
115171568Sscottlis faster, case correction is postponed until the char* result is requested.
116171568SscottlOn platforms that identify the inode, device, or proper name interchangably
117171568Sscottlwith no penalities, this may occur when the name is initially processed.</p>
118171568Sscottl
119171568Sscottl<hr>
120171568Sscottl
121171568Sscottl<h2>Unix Example</h2>
122171568Sscottl
123171568Sscottl<p>First the simplest case:</p>
124171568Sscottl
125171568Sscottl<pre>
126171568SscottlParse Canonical Name 
127171568Sscottlaccepts parent path as canonical_t
128171568Sscottl        this path as string
129171568Sscottl
130171568SscottlSplit this path Segments on '/'
131171568Sscottl
132171568SscottlFor each of this path Segments
133171568Sscottl  If first Segment
134171568Sscottl    If this Segment is Empty ([nothing]/)
135171568Sscottl      Append this Root Segment (don't merge)
136171568Sscottl      Continue to next Segment
137171568Sscottl    Else is relative
138171568Sscottl      Append parent Segments (to merge)
139171568Sscottl      Continue with this Segment
140171568Sscottl  If Segment is '.' or empty (2 slashes)
141171568Sscottl    Discard this Segment
142171568Sscottl    Continue with next Segment
143171568Sscottl  If Segment is '..'
144171568Sscottl    If no previous Segment or previous Segment is '..'
145171568Sscottl      Append this Segment
146171568Sscottl      Continue with next Segment
147171568Sscottl    If previous Segment and previous is not Root Segment
148171568Sscottl      Discard previous Segment
149171568Sscottl    Discard this Segment
150171568Sscottl    Continue with next Segment
151171568Sscottl  Append this Relative Segment
152171568Sscottl  Continue with next Segment        
153171568Sscottl</pre>
154171568Sscottl
155171568Sscottl</BODY>
156171568Sscottl</HTML>
157211095Sdes