1rdef grammar
2============
3
4This is the (somewhat boring) specification of the rdef file format as it is understood by librdef.
5It also describes to a certain extent how the compiler works. You don't need to read this unless
6you want to hack librdef. Knowledge of compiler theory and lex/yacc is assumed.
7
8The lexer
9---------
10
11Like any compiler, librdef contains a lexer (aka scanner) and a parser. The lexer reads the input
12file and chops it up into tokens. The lexer ignores single-line ``//`` comments and ``/* ... */``
13multi-line comments. It also ignores whitespace and newlines.
14
15The lexer recognizes the following tokens:
16
17BOOL
18    true or false
19
20INTEGER
21    You can specify integers as decimal numbers, hexadecimal numbers (starting with 0x or 0X, alpha
22    digits are case insensitive), octal numbers (starting with a leading 0), binary numbers
23    (starting with 0b or 0B), or as a four character code ('CCCC'). Valid range is 64 bits. At
24    this point, numbers are always unsigned. The minus sign is treated as a separate token, and is
25    dealt with by the parser.
26
27FLOAT
28    A floating point literal. Must contain a decimal point, may contain a signed exponent. Stored internally as a double.
29
30STRING
31    UTF-8 compatible string literal, enclosed by double quotes. Can contain escape sequences
32    (\b \f \n \r \t \v \" \\ \0), octal escapes (\000) and hex escapes (\0x00 or \x00). May not
33    span more than one line, although you are allowed to specify multiple string literals in a row
34    and the lexer will automatically concatenate them. There is no maximum length.
35
36RAW
37    Hexadecimal representation of raw data, enclosed by double quotes, and prefixed by a dollar
38    sign: $"12FFAB". Each byte is represented by two hex characters, so there must be an even
39    number of characters between the quotes. The alpha digits are not case sensitive. Like STRING,
40    a RAW token may not span more than one line, but multiple consecutive RAW tokens are
41    automatically concatenated. No maximum length.
42
43IDENT
44    C/C++ identifier. First character is alphabetic or underscore. Other characters are
45    alphanumeric or underscore.
46
47TYPECODE
48    A hash sign followed by a 32-bit unsigned decimal number, hex number, or four character code.
49    Examples: #200, #0x00C8, #'MIMS'
50
51The following are treated as keywords and special symbols:
52
53``enum resource array message archive type import { } ( ) , ; = - + * / % ^ | & ~``
54
55The lexer also deals with #include statements, which look like: #include "filename"\n. When you
56#include a file, the lexer expects it to contain valid rdef syntax. So even though the include file
57is probably a C/C++ header, it should not contain anything but the enum statement and/or comments.
58The lexer only looks for include files in the include search paths that you have specified, so if
59you want it to look in the current working directory you have to explicitly specify that. You may
60nest #includes.
61
62A note about UTF-8. Since the double quote (hex 0x22) is never part of the second or third byte of
63a UTF-8 character, the lexer can safely deal with UTF-8 characters inside string literals. That is
64also the reason that the decompiler does not escape characters that are not human-readable
65(except the ones in the 7-bit ASCII range), because they could be part of a UTF-8 encoding.
66The current version of librdef does not handle L"..." (wide char) strings, but nobody uses them anyway.
67
68The parser
69----------
70
71The parser takes the tokens from the lexer and matches them against the rules of the grammar. What
72follows is the grammar in a simplified variation of BNF, so the actual bison source file may look
73a little different. Legend:
74
75+-------------+-------------------------------------+
76| `[ a ]`     | match a 0 or 1 times                |
77+-------------+-------------------------------------+
78| `{ b }`     | match b 0 or more times             |
79+-------------+-------------------------------------+
80| `c | d`     | match either c or d                 |
81+-------------+-------------------------------------+
82| `( e f )`   | group e and f together              |
83+-------------+-------------------------------------+
84| lowercase   | nonterminal                         |
85+-------------+-------------------------------------+
86| UPPER       | token from the lexer                |
87+-------------+-------------------------------------+
88| `'c'`       | token from the lexer                |
89+-------------+-------------------------------------+
90
91The rdef grammar consists of the following rules:
92
93
94script
95     {enum | typedef | resource}
96
97enum
98    ENUM '{' [symboldef {',' symboldef} [',']] '}' ';'
99
100symboldef
101    IDENT ['=' integer]
102
103typedef
104    TYPE [id] [TYPECODE] IDENT '{' fielddef {',' fielddef} '}' ';'
105
106fielddef
107    datatype IDENT ['[' INTEGER ']'] ['=' expr]
108
109resource
110    RESOURCE [id] [typecode] expr ';'
111
112id
113    '(' [(integer | IDENT) [',' STRING] | STRING] ')'
114
115typecode
116    ['('] TYPECODE [')']
117
118expr
119    expr BINARY_OPER expr | UNARY_OPER expr | data
120
121data
122    [typecast] (BOOL | integer | float | STRING | RAW | array | message | archive | type | define | '(' expr ')' )
123
124typecast
125    ['(' datatype ')']
126
127datatype
128    ARRAY | MESSAGE | ARCHIVE IDENT | IDENT
129
130integer
131    ['-'] INTEGER
132
133float
134    ['-'] FLOAT
135
136array
137    ARRAY ['{' [expr {',' expr}] '}'] | [ARRAY] IMPORT STRING
138
139message
140    MESSAGE ['(' integer ')'] ['{' [msgfield {',' msgfield}] '}']
141
142msgfield
143    [TYPECODE] [datatype] STRING '=' expr
144
145archive
146    ARCHIVE [archiveid] IDENT '{' msgfield {',' msgfield} '}'
147
148archiveid
149    '(' [STRING] [',' integer] ')'
150
151type
152    IDENT [data | '{' [typefield {',' typefield}] '}']
153
154typefield
155    [IDENT '='] expr
156
157define
158    IDENT
159
160Semantics
161---------
162
163Resource names
164##############
165
166There are several different ways to specify the ID and name of a new resource:
167
168``resource``
169    The resource is assigned the default name and ID of its data type.
170
171``resource()``
172    The resource is assigned the default name and ID of its data type.
173
174``resource(1)``
175    The resource is assigned the numeric ID 1, and the default name of its data type.
176
177``resource("xxx")``
178    The resource is assigned the name "xxx" and the default ID of its data type.
179
180``resource(1, "xxx")``
181    The resource is assigned the numeric ID 1, and the name "xxx".
182
183``resource(sss)``
184    The resource is assigned the numeric ID that corresponds with the symbol sss, which should have
185    been defined in an enum earlier. If the "auto names" option is passed to the compiler, the
186    resource is also given the name "sss", otherwise the default name from its data type is used
187
188``resource(sss, "xxx")``
189    The resource is assigned the numeric ID that corresponds with the symbol sss, and the name "xxx".
190
191Data types and type casts
192#########################
193
194Resources (and message fields) have a type code and a data type. The data type determines the
195format the data is stored in, while the type code tells the user how to interpret the data.
196Typically, there is some kind of relation between the two, otherwise the resource will be a little
197hard to read.
198
199The following table lists the compiler's built-in data types. (Users can also define their own
200types; this is described in a later section.)
201
202+---------+----------------+
203| bool    | B_BOOL_TYPE    |
204+---------+----------------+
205| int8    | B_INT8_TYPE    |
206+---------+----------------+
207| uint8   | B_UINT8_TYPE   |
208+---------+----------------+
209| int16   | B_INT16_TYPE   |
210+---------+----------------+
211| uint16  | B_UINT16_TYPE  |
212+---------+----------------+
213| int32   | B_INT32_TYPE   |
214+---------+----------------+
215| uint32  | B_UINT32_TYPE  |
216+---------+----------------+
217| int64   | B_INT64_TYPE   |
218+---------+----------------+
219| uint64  | B_UINT64_TYPE  |
220+---------+----------------+
221| size_t  | B_SIZE_T_TYPE  |
222+---------+----------------+
223| ssize_t | B_SSIZE_T_TYPE |
224+---------+----------------+
225| off_t   | B_OFF_T_TYPE   |
226+---------+----------------+
227| time_t  | B_TIME_TYPE    |
228+---------+----------------+
229| float   | B_FLOAT_TYPE   |
230+---------+----------------+
231| double  | B_DOUBLE_TYPE  |
232+---------+----------------+
233| string  | B_STRING_TYPE  |
234+---------+----------------+
235| raw     | B_RAW_TYPE     |
236+---------+----------------+
237| array   | B_RAW_TYPE     |
238+---------+----------------+
239| buffer  | B_RAW_TYPE     |
240+---------+----------------+
241| message | B_MESSAGE_TYPE |
242+---------+----------------+
243| archive | B_MESSAGE_TYPE |
244+---------+----------------+
245
246The type code has no effect on how the data is stored. For example, if you do this:
247"resource(x) #'LONG' true", then the data will not automatically be stored as a 32-bit number!
248If you don't specify an explicit type code, the compiler uses the type of the data for that.
249
250You can change the data type with a type cast. The following casts are allowed:
251
252+--------------------+--------------------------------------------------------------------------+
253| bool               | You cannot cast bool data.                                               |
254+--------------------+--------------------------------------------------------------------------+
255| integer            | You can cast to all numeric data types. Casts to smaller datatypes will  |
256|                    | truncate the number. Casting negative numbers to unsigned datatypes (and |
257|                    | vice versa) will wrap them, i.e. (uint8) -1 becomes 255.                 |
258+--------------------+--------------------------------------------------------------------------+
259| floating point     | You can only cast to float or double.                                    |
260+--------------------+--------------------------------------------------------------------------+
261| string             | You cannot cast string data.                                             |
262+--------------------+--------------------------------------------------------------------------+
263| raw, buffer, array | You can cast anything to raw, but not the other way around.              |
264+--------------------+--------------------------------------------------------------------------+
265| message, archive   | You cannot cast message data.                                            |
266+--------------------+--------------------------------------------------------------------------+
267| type               | You cannot cast user-defined types.                                      |
268+--------------------+--------------------------------------------------------------------------+
269
270In addition to the "simple" built-in data types, the compiler also natively supports several data
271structures from the BeOS API (point, rect, rgb_color) and a few convenience types (app_signature,
272app_flags, etc). These types all follow the same rules as user-defined types.
273
274Arrays
275######
276
277The following definitions are semantically equivalent:
278
279.. code-block:: c
280
281    resource(x) $"AABB";
282    resource(x) array { $"AA" $"BB" };
283    resource(x) array { $"AA", $"BB" };
284
285The comma is optional and simply concatenates the two literals. When you decompile this code,
286it always looks like:
287
288.. code-block:: c
289
290    resource(x) $"AABB";
291
292Strings behave differently. The following two definitions are equivalent, and concatenate the two
293literals into one string:
294
295.. code-block::
296
297    resource(x) "AA" "BB";
298    resource(x) #'CSTR' array { "AA" "BB" };
299
300However, if you put a comma between the the strings, the compiler will still glue them together
301but with a '\0' character in the middle. Now the resource contains *two* strings: "AA" and "BB".
302You can also specify the '\0' character yourself:
303
304.. code-block::
305
306    resource(x) "AA\0BB";
307    resource(x) #'CSTR' array { "AA", "BB" };
308
309The following is not proper grammar; use an array instead:
310
311.. code-block:: c
312
313    resource(x) "AA", "BB";
314    resource(x) $"AA", $"BB";
315
316Note that the data type of an array is always raw data, no matter how you specify its contents.
317Because raw literals may be empty ($""), so may arrays.
318
319Messages and archives
320#####################
321
322A message resource is a flattened BMessage. By default it has the data type B_MESSAGE_TYPE and
323corresponding type code #'MSGG'. If you don't specify a "what" code for the message, it defaults to 0.
324
325Message fields assume the type of their data, unless you specify a different type in front of the
326field name. (Normal casting rules apply here.) You can also give the field a different type code,
327which tells the BMessage how to interpret the data, but not how it is stored in the message.
328This type code also goes in front of the field name. You can give the same name to multiple fields,
329provided that they all have the same type. (The data of these fields does not have to be the same
330size.) A message may be empty; it is still a valid BMessage, but it contains no fields.
331
332An archive is also a flattened BMessage, but one that was made by Archive()'ing a BArchivable class,
333such as BBitmap. The name of the archive, in this case BBitmap, is automatically added to the
334message in a field called "class". The "archive" keyword is optionally followed by a set of
335parentheses that enclose a string and/or an integer. The int is the "what" code, the string is
336stored in a field called "add_on" (used for dynamic loading of BArchivables). Other than that,
337archives and messages are identical. The compiler does not check whether the contents of the
338archive actually make sense, so if you don't structure the data properly you may be unable to
339unarchive the object later. Unlike a message, an archive may not be empty, because that is pointless.
340
341User-defined types
342##################
343
344We allow users to define their own types. A "type" is just a fancy array, because the data from the
345various fields is simply concatenated into one big block of bytes. The difference is that
346user-defined types are much easier to fill in.
347
348A user-defined type has a symbolic name, a type code, and a number of data fields. After all the
349fields have been concatenated, the type code is applied to the whole block. So, the data type of
350this resource is always the same as its type code (unlike arrays, which are always raw data).
351If no type code is specified, it defaults to B_RAW_TYPE.
352
353The data fields always have a default value. For simple fields this is typically 0 (numeric types)
354or empty (string, raw, message). The default value of a user-defined type as a whole is the
355combination of the default values of its fields. Of course, the user can specify other defaults.
356(When a user creates a new resource that uses such a type, he is basically overriding the default
357values with his own.)
358
359The user may fill in the data fields by name, by order, or using a combination of both. Every time
360the compiler sees an unnamed data item, it stuffs it into the next available field. Named data
361items are simply assigned to the field with the same name, and may overwrite a value that was
362previously put there "by order". Any fields that are not filled in keep their default value. For
363example:
364
365.. code-block:: c
366
367    type vector { int32 x, int32 y, int32 z, int32 w = 4 };
368    resource(1) vector { 1, 3, x = 2 };
369
370Here, x is first set to 1, y is set to 3, x is now overwritten by the value 2, z is given the
371default value 0, and w defaults to 4.
372
373Note: if a user-defined type contains string, raw, or message fields, the size of the type depends
374on the data that the user puts into it, because these fields have a variable size. However, the
375user may specify a fixed size for a field (number of bytes, enclosed in square brackets following
376the field name). In this case, data that is too long will be truncated and data that is too short
377will be padded with zeroes. You can do this for all types, but it really only makes sense for
378strings and raw data. More about this in the manual.
379
380A type definition may also contain a default resource ID and name. The default ID of built-in types
381is usually 1 and the name is empty (NULL). For example:
382
383.. code-block:: c
384
385    type(10, "MyName") mytype { int32 a };
386    resource mytype 123;
387
388The resource is now called "MyName" and has ID 10. Obviously you can only do this once or you will
389receive a duplicate resource error. If this type is used inside an array or other compound type,
390the default ID and resource name are ignored. Note: this feature introduces a shift/reduce conflict
391in the compiler:
392
393.. code-block:: c
394
395    resource (int8) 123;
396
397This probably doesn't do what you expect. The compiler now considers the "(int8)" to be the
398resource ID, not a typecast. If you did not declare "int8" in an enum (probably not), this gives a
399compiler error. Not a big problem, because it is unlikely that you will ever do this. Here is a
400workaround:
401
402.. code-block:: c
403
404    resource() (int8) 123;
405
406The grammar and Bison
407#####################
408
409Above I mentioned one of the shift/reduce conflicts from the grammar. There are several others.
410These are mostly the result of keeping the original grammar intact as much as possible, without
411having to introduce weird syntax rules for the new features. These issues aren't fatal but if you
412try to do something funky in your script, you may get an error message.
413
414The main culprit here is the "( expr )" rule from "data", which allows you to nest expressions with
415parens, e.g. "`(10 + 5) * 3`". This causes problems for Bison, because we already use parens all
416over the place. Specifically, this rule conflicts with the empty "MESSAGE" from the "message" rule,
417"ARRAY" from "array", and "IDENT" from "type". These rules have no other symbols following them,
418which makes them ambiguous with respect to the "datatype" rules. Still with me? The parser will
419typically pick the right one, though.
420
421The nested expressions rule also caused a reduce/reduce conflict. To get rid of that, I had to
422explicitly mention the names of the various types in the "typecast" rule, which introduces a little
423code duplication but it's not too bad. Just so you know, the original rule was simply:
424
425.. code-block::
426
427    typecast
428        : '(' datatype ')' { $$ = $2; }
429        ;
430
431The new rule is a little more bulky:
432
433.. code-block::
434
435    typecast
436        : '(' ARRAY ')'   { ... }
437        | '(' MESSAGE ')' { ... }
438        ... and so on for all the datatypes ...
439        ;
440
441The unary minus operator is not part of the "expr" (or "data") rules, but of "integer" and "float".
442This also causes a shift/reduce warning.
443
444And finally, "type" is a member of "data" which is called by "expr". One of the rules of "type"
445refers back to "expr". This introduces a recursive dependency and a whole bunch of shift/reduce
446conflicts. Fortunately, it seems to have no bad side effects. Yay!
447
448Symbol table
449############
450
451The compiler uses two symbol tables: one for the enum symbols, and one with the data type
452definitions. We need those tables to find the numeric ID/type definition that corresponds with an
453identifier, and to make sure that there are no duplicate or missing identifiers. These two symbol
454tables are independent, so you may use the same identifier both as an enum symbol and a type name.
455
456The compiler does not need to keep a symbol table for the resources. Although the combination of a
457resource's ID and its type code must be unique, we can use the BResources class to check for this
458when we add a resource. There is no point in duplicating this functionality in the compiler.
459(However, if we are merging the new resources into an existing resource file, we will simply
460overwrite duplicates.)
461
462Misc remarks
463############
464
465As the grammar shows, the last field in an enum statement may be followed by a comma. This is
466valid C/C++ syntax, and since the enum will typically end up in a header file, we support that
467feature as well. For anything else between braces, the last item may not be followed by a comma.
468
469The type code that follows the "resource" keyword may be enclosed by parens for historical reasons.
470The preferred notation is without, just like in front of field names (where no parens are allowed).
471
472Even though "ARCHIVE IDENT" is a valid data type, we simply ignore the identifier for now. Maybe
473later we will support casting between different archives or whatever. For now, casting to an
474archive is the same as casting to a message, since an archive is really a message.
475
476User-defined types and defines have their own symbol tables. Due to the way the parser is
477structured, we can only distinguish between types and defines by matching the identifier against
478both symbol tables. Here types have priority, so you could 'shadow' a define with a type name if
479you were really evil. Maybe it makes sense for rc to use one symbol table for all things in the
480future, especially since we're using yet another table for enums. We'll see.
481