1
2The .xz File Format
3===================
4
5Version 1.0.4 (2009-08-27)
6
7
8        0. Preface
9           0.1. Notices and Acknowledgements
10           0.2. Getting the Latest Version
11           0.3. Version History
12        1. Conventions
13           1.1. Byte and Its Representation
14           1.2. Multibyte Integers
15        2. Overall Structure of .xz File
16           2.1. Stream
17                2.1.1. Stream Header
18                       2.1.1.1. Header Magic Bytes
19                       2.1.1.2. Stream Flags
20                       2.1.1.3. CRC32
21                2.1.2. Stream Footer
22                       2.1.2.1. CRC32
23                       2.1.2.2. Backward Size
24                       2.1.2.3. Stream Flags
25                       2.1.2.4. Footer Magic Bytes
26           2.2. Stream Padding
27        3. Block
28           3.1. Block Header
29                3.1.1. Block Header Size
30                3.1.2. Block Flags
31                3.1.3. Compressed Size
32                3.1.4. Uncompressed Size
33                3.1.5. List of Filter Flags
34                3.1.6. Header Padding
35                3.1.7. CRC32
36           3.2. Compressed Data
37           3.3. Block Padding
38           3.4. Check
39        4. Index
40           4.1. Index Indicator
41           4.2. Number of Records
42           4.3. List of Records
43                4.3.1. Unpadded Size
44                4.3.2. Uncompressed Size
45           4.4. Index Padding
46           4.5. CRC32
47        5. Filter Chains
48           5.1. Alignment
49           5.2. Security
50           5.3. Filters
51                5.3.1. LZMA2
52                5.3.2. Branch/Call/Jump Filters for Executables
53                5.3.3. Delta
54                       5.3.3.1. Format of the Encoded Output
55           5.4. Custom Filter IDs
56                5.4.1. Reserved Custom Filter ID Ranges
57        6. Cyclic Redundancy Checks
58        7. References
59
60
610. Preface
62
63        This document describes the .xz file format (filename suffix
64        ".xz", MIME type "application/x-xz"). It is intended that this
65        this format replace the old .lzma format used by LZMA SDK and
66        LZMA Utils.
67
68
690.1. Notices and Acknowledgements
70
71        This file format was designed by Lasse Collin
72        <lasse.collin@tukaani.org> and Igor Pavlov.
73
74        Special thanks for helping with this document goes to
75        Ville Koskinen. Thanks for helping with this document goes to
76        Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
77
78        This document has been put into the public domain.
79
80
810.2. Getting the Latest Version
82
83        The latest official version of this document can be downloaded
84        from <http://tukaani.org/xz/xz-file-format.txt>.
85
86        Specific versions of this document have a filename
87        xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
88        For example, the version 1.0.0 of this document is available
89        at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
90
91
920.3. Version History
93
94        Version   Date          Description
95
96        1.0.4     2009-08-27    Language improvements in Sections 1.2,
97                                2.1.1.2, 3.1.1, 3.1.2, and 5.3.1
98
99        1.0.3     2009-06-05    Spelling fixes in Sections 5.1 and 5.4
100
101        1.0.2     2009-06-04    Typo fixes in Sections 4 and 5.3.1
102
103        1.0.1     2009-06-01    Typo fix in Section 0.3 and minor
104                                clarifications to Sections 2, 2.2,
105                                3.3, 4.4, and 5.3.2
106
107        1.0.0     2009-01-14    The first official version
108
109
1101. Conventions
111
112        The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
113        "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
114        document are to be interpreted as described in [RFC-2119].
115
116        Indicating a warning means displaying a message, returning
117        appropriate exit status, or doing something else to let the
118        user know that something worth warning occurred. The operation
119        SHOULD still finish if a warning is indicated.
120
121        Indicating an error means displaying a message, returning
122        appropriate exit status, or doing something else to let the
123        user know that something prevented successfully finishing the
124        operation. The operation MUST be aborted once an error has
125        been indicated.
126
127
1281.1. Byte and Its Representation
129
130        In this document, byte is always 8 bits.
131
132        A "null byte" has all bits unset. That is, the value of a null
133        byte is 0x00.
134
135        To represent byte blocks, this document uses notation that
136        is similar to the notation used in [RFC-1952]:
137
138            +-------+
139            |  Foo  |   One byte.
140            +-------+
141
142            +---+---+
143            |  Foo  |   Two bytes; that is, some of the vertical bars
144            +---+---+   can be missing.
145
146            +=======+
147            |  Foo  |   Zero or more bytes.
148            +=======+
149
150        In this document, a boxed byte or a byte sequence declared
151        using this notation is called "a field". The example field
152        above would be called "the Foo field" or plain "Foo".
153
154        If there are many fields, they may be split to multiple lines.
155        This is indicated with an arrow ("--->"):
156
157            +=====+
158            | Foo |
159            +=====+
160
161                 +=====+
162            ---> | Bar |
163                 +=====+
164
165        The above is equivalent to this:
166
167            +=====+=====+
168            | Foo | Bar |
169            +=====+=====+
170
171
1721.2. Multibyte Integers
173
174        Multibyte integers of static length, such as CRC values,
175        are stored in little endian byte order (least significant
176        byte first).
177
178        When smaller values are more likely than bigger values (for
179        example file sizes), multibyte integers are encoded in a
180        variable-length representation:
181          - Numbers in the range [0, 127] are copied as is, and take
182            one byte of space.
183          - Bigger numbers will occupy two or more bytes. All but the
184            last byte of the multibyte representation have the highest
185            (eighth) bit set.
186
187        For now, the value of the variable-length integers is limited
188        to 63 bits, which limits the encoded size of the integer to
189        nine bytes. These limits may be increased in the future if
190        needed.
191
192        The following C code illustrates encoding and decoding of
193        variable-length integers. The functions return the number of
194        bytes occupied by the integer (1-9), or zero on error.
195
196            #include <stddef.h>
197            #include <inttypes.h>
198
199            size_t
200            encode(uint8_t buf[static 9], uint64_t num)
201            {
202                if (num > UINT64_MAX / 2)
203                    return 0;
204
205                size_t i = 0;
206
207                while (num >= 0x80) {
208                    buf[i++] = (uint8_t)(num) | 0x80;
209                    num >>= 7;
210                }
211
212                buf[i++] = (uint8_t)(num);
213
214                return i;
215            }
216
217            size_t
218            decode(const uint8_t buf[], size_t size_max, uint64_t *num)
219            {
220                if (size_max == 0)
221                    return 0;
222
223                if (size_max > 9)
224                    size_max = 9;
225
226                *num = buf[0] & 0x7F;
227                size_t i = 0;
228
229                while (buf[i++] & 0x80) {
230                    if (i >= size_max || buf[i] == 0x00)
231                        return 0;
232
233                    *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
234                }
235
236                return i;
237            }
238
239
2402. Overall Structure of .xz File
241
242        A standalone .xz files consist of one or more Streams which may
243        have Stream Padding between or after them:
244
245            +========+================+========+================+
246            | Stream | Stream Padding | Stream | Stream Padding | ...
247            +========+================+========+================+
248
249        The sizes of Stream and Stream Padding are always multiples
250        of four bytes, thus the size of every valid .xz file MUST be
251        a multiple of four bytes.
252
253        While a typical file contains only one Stream and no Stream
254        Padding, a decoder handling standalone .xz files SHOULD support
255        files that have more than one Stream or Stream Padding.
256
257        In contrast to standalone .xz files, when the .xz file format
258        is used as an internal part of some other file format or
259        communication protocol, it usually is expected that the decoder
260        stops after the first Stream, and doesn't look for Stream
261        Padding or possibly other Streams.
262
263
2642.1. Stream
265
266        +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
267        |     Stream Header     | Block | Block | ... | Block |
268        +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
269
270             +=======+-+-+-+-+-+-+-+-+-+-+-+-+
271        ---> | Index |     Stream Footer     |
272             +=======+-+-+-+-+-+-+-+-+-+-+-+-+
273
274        All the above fields have a size that is a multiple of four. If
275        Stream is used as an internal part of another file format, it
276        is RECOMMENDED to make the Stream start at an offset that is
277        a multiple of four bytes.
278
279        Stream Header, Index, and Stream Footer are always present in
280        a Stream. The maximum size of the Index field is 16 GiB (2^34).
281
282        There are zero or more Blocks. The maximum number of Blocks is
283        limited only by the maximum size of the Index field.
284
285        Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
286        The same limit applies to the total amount of uncompressed
287        data stored in a Stream.
288
289        If an implementation supports handling .xz files with multiple
290        concatenated Streams, it MAY apply the above limits to the file
291        as a whole instead of limiting per Stream basis.
292
293
2942.1.1. Stream Header
295
296        +---+---+---+---+---+---+-------+------+--+--+--+--+
297        |  Header Magic Bytes   | Stream Flags |   CRC32   |
298        +---+---+---+---+---+---+-------+------+--+--+--+--+
299
300
3012.1.1.1. Header Magic Bytes
302
303        The first six (6) bytes of the Stream are so called Header
304        Magic Bytes. They can be used to identify the file type.
305
306            Using a C array and ASCII:
307            const uint8_t HEADER_MAGIC[6]
308                    = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
309
310            In plain hexadecimal:
311            FD 37 7A 58 5A 00
312
313        Notes:
314          - The first byte (0xFD) was chosen so that the files cannot
315            be erroneously detected as being in .lzma format, in which
316            the first byte is in the range [0x00, 0xE0].
317          - The sixth byte (0x00) was chosen to prevent applications
318            from misdetecting the file as a text file.
319
320        If the Header Magic Bytes don't match, the decoder MUST
321        indicate an error.
322
323
3242.1.1.2. Stream Flags
325
326        The first byte of Stream Flags is always a null byte. In the
327        future, this byte may be used to indicate a new Stream version
328        or other Stream properties.
329
330        The second byte of Stream Flags is a bit field:
331
332            Bit(s)  Mask  Description
333             0-3    0x0F  Type of Check (see Section 3.4):
334                              ID    Size      Check name
335                              0x00   0 bytes  None
336                              0x01   4 bytes  CRC32
337                              0x02   4 bytes  (Reserved)
338                              0x03   4 bytes  (Reserved)
339                              0x04   8 bytes  CRC64
340                              0x05   8 bytes  (Reserved)
341                              0x06   8 bytes  (Reserved)
342                              0x07  16 bytes  (Reserved)
343                              0x08  16 bytes  (Reserved)
344                              0x09  16 bytes  (Reserved)
345                              0x0A  32 bytes  SHA-256
346                              0x0B  32 bytes  (Reserved)
347                              0x0C  32 bytes  (Reserved)
348                              0x0D  64 bytes  (Reserved)
349                              0x0E  64 bytes  (Reserved)
350                              0x0F  64 bytes  (Reserved)
351             4-7    0xF0  Reserved for future use; MUST be zero for now.
352
353        Implementations SHOULD support at least the Check IDs 0x00
354        (None) and 0x01 (CRC32). Supporting other Check IDs is
355        OPTIONAL. If an unsupported Check is used, the decoder SHOULD
356        indicate a warning or error.
357
358        If any reserved bit is set, the decoder MUST indicate an error.
359        It is possible that there is a new field present which the
360        decoder is not aware of, and can thus parse the Stream Header
361        incorrectly.
362
363
3642.1.1.3. CRC32
365
366        The CRC32 is calculated from the Stream Flags field. It is
367        stored as an unsigned 32-bit little endian integer. If the
368        calculated value does not match the stored one, the decoder
369        MUST indicate an error.
370
371        The idea is that Stream Flags would always be two bytes, even
372        if new features are needed. This way old decoders will be able
373        to verify the CRC32 calculated from Stream Flags, and thus
374        distinguish between corrupt files (CRC32 doesn't match) and
375        files that the decoder doesn't support (CRC32 matches but
376        Stream Flags has reserved bits set).
377
378
3792.1.2. Stream Footer
380
381        +-+-+-+-+---+---+---+---+-------+------+----------+---------+
382        | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
383        +-+-+-+-+---+---+---+---+-------+------+----------+---------+
384
385
3862.1.2.1. CRC32
387
388        The CRC32 is calculated from the Backward Size and Stream Flags
389        fields. It is stored as an unsigned 32-bit little endian
390        integer. If the calculated value does not match the stored one,
391        the decoder MUST indicate an error.
392
393        The reason to have the CRC32 field before the Backward Size and
394        Stream Flags fields is to keep the four-byte fields aligned to
395        a multiple of four bytes.
396
397
3982.1.2.2. Backward Size
399
400        Backward Size is stored as a 32-bit little endian integer,
401        which indicates the size of the Index field as multiple of
402        four bytes, minimum value being four bytes:
403
404            real_backward_size = (stored_backward_size + 1) * 4;
405
406        If the stored value does not match the real size of the Index
407        field, the decoder MUST indicate an error.
408
409        Using a fixed-size integer to store Backward Size makes
410        it slightly simpler to parse the Stream Footer when the
411        application needs to parse the Stream backwards.
412
413
4142.1.2.3. Stream Flags
415
416        This is a copy of the Stream Flags field from the Stream
417        Header. The information stored to Stream Flags is needed
418        when parsing the Stream backwards. The decoder MUST compare
419        the Stream Flags fields in both Stream Header and Stream
420        Footer, and indicate an error if they are not identical.
421
422
4232.1.2.4. Footer Magic Bytes
424
425        As the last step of the decoding process, the decoder MUST
426        verify the existence of Footer Magic Bytes. If they don't
427        match, an error MUST be indicated.
428
429            Using a C array and ASCII:
430            const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
431
432            In hexadecimal:
433            59 5A
434
435        The primary reason to have Footer Magic Bytes is to make
436        it easier to detect incomplete files quickly, without
437        uncompressing. If the file does not end with Footer Magic Bytes
438        (excluding Stream Padding described in Section 2.2), it cannot
439        be undamaged, unless someone has intentionally appended garbage
440        after the end of the Stream.
441
442
4432.2. Stream Padding
444
445        Only the decoders that support decoding of concatenated Streams
446        MUST support Stream Padding.
447
448        Stream Padding MUST contain only null bytes. To preserve the
449        four-byte alignment of consecutive Streams, the size of Stream
450        Padding MUST be a multiple of four bytes. Empty Stream Padding
451        is allowed. If these requirements are not met, the decoder MUST
452        indicate an error.
453
454        Note that non-empty Stream Padding is allowed at the end of the
455        file; there doesn't need to be a new Stream after non-empty
456        Stream Padding. This can be convenient in certain situations
457        [GNU-tar].
458
459        The possibility of Stream Padding MUST be taken into account
460        when designing an application that parses Streams backwards,
461        and the application supports concatenated Streams.
462
463
4643. Block
465
466        +==============+=================+===============+=======+
467        | Block Header | Compressed Data | Block Padding | Check |
468        +==============+=================+===============+=======+
469
470
4713.1. Block Header
472
473        +-------------------+-------------+=================+
474        | Block Header Size | Block Flags | Compressed Size |
475        +-------------------+-------------+=================+
476
477             +===================+======================+
478        ---> | Uncompressed Size | List of Filter Flags |
479             +===================+======================+
480
481             +================+--+--+--+--+
482        ---> | Header Padding |   CRC32   |
483             +================+--+--+--+--+
484
485
4863.1.1. Block Header Size
487
488        This field overlaps with the Index Indicator field (see
489        Section 4.1).
490
491        This field contains the size of the Block Header field,
492        including the Block Header Size field itself. Valid values are
493        in the range [0x01, 0xFF], which indicate the size of the Block
494        Header as multiples of four bytes, minimum size being eight
495        bytes:
496
497            real_header_size = (encoded_header_size + 1) * 4;
498
499        If a Block Header bigger than 1024 bytes is needed in the
500        future, a new field can be added between the Block Header and
501        Compressed Data fields. The presence of this new field would
502        be indicated in the Block Header field.
503
504
5053.1.2. Block Flags
506
507        The Block Flags field is a bit field:
508
509            Bit(s)  Mask  Description
510             0-1    0x03  Number of filters (1-4)
511             2-5    0x3C  Reserved for future use; MUST be zero for now.
512              6     0x40  The Compressed Size field is present.
513              7     0x80  The Uncompressed Size field is present.
514
515        If any reserved bit is set, the decoder MUST indicate an error.
516        It is possible that there is a new field present which the
517        decoder is not aware of, and can thus parse the Block Header
518        incorrectly.
519
520
5213.1.3. Compressed Size
522
523        This field is present only if the appropriate bit is set in
524        the Block Flags field (see Section 3.1.2).
525
526        The Compressed Size field contains the size of the Compressed
527        Data field, which MUST be non-zero. Compressed Size is stored
528        using the encoding described in Section 1.2. If the Compressed
529        Size doesn't match the size of the Compressed Data field, the
530        decoder MUST indicate an error.
531
532
5333.1.4. Uncompressed Size
534
535        This field is present only if the appropriate bit is set in
536        the Block Flags field (see Section 3.1.2).
537
538        The Uncompressed Size field contains the size of the Block
539        after uncompressing. Uncompressed Size is stored using the
540        encoding described in Section 1.2. If the Uncompressed Size
541        does not match the real uncompressed size, the decoder MUST
542        indicate an error.
543
544        Storing the Compressed Size and Uncompressed Size fields serves
545        several purposes:
546          - The decoder knows how much memory it needs to allocate
547            for a temporary buffer in multithreaded mode.
548          - Simple error detection: wrong size indicates a broken file.
549          - Seeking forwards to a specific location in streamed mode.
550
551        It should be noted that the only reliable way to determine
552        the real uncompressed size is to uncompress the Block,
553        because the Block Header and Index fields may contain
554        (intentionally or unintentionally) invalid information.
555
556
5573.1.5. List of Filter Flags
558
559        +================+================+     +================+
560        | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
561        +================+================+     +================+
562
563        The number of Filter Flags fields is stored in the Block Flags
564        field (see Section 3.1.2).
565
566        The format of each Filter Flags field is as follows:
567
568            +===========+====================+===================+
569            | Filter ID | Size of Properties | Filter Properties |
570            +===========+====================+===================+
571
572        Both Filter ID and Size of Properties are stored using the
573        encoding described in Section 1.2. Size of Properties indicates
574        the size of the Filter Properties field as bytes. The list of
575        officially defined Filter IDs and the formats of their Filter
576        Properties are described in Section 5.3.
577
578        Filter IDs greater than or equal to 0x4000_0000_0000_0000
579        (2^62) are reserved for implementation-specific internal use.
580        These Filter IDs MUST never be used in List of Filter Flags.
581
582
5833.1.6. Header Padding
584
585        This field contains as many null byte as it is needed to make
586        the Block Header have the size specified in Block Header Size.
587        If any of the bytes are not null bytes, the decoder MUST
588        indicate an error. It is possible that there is a new field
589        present which the decoder is not aware of, and can thus parse
590        the Block Header incorrectly.
591
592
5933.1.7. CRC32
594
595        The CRC32 is calculated over everything in the Block Header
596        field except the CRC32 field itself. It is stored as an
597        unsigned 32-bit little endian integer. If the calculated
598        value does not match the stored one, the decoder MUST indicate
599        an error.
600
601        By verifying the CRC32 of the Block Header before parsing the
602        actual contents allows the decoder to distinguish between
603        corrupt and unsupported files.
604
605
6063.2. Compressed Data
607
608        The format of Compressed Data depends on Block Flags and List
609        of Filter Flags. Excluding the descriptions of the simplest
610        filters in Section 5.3, the format of the filter-specific
611        encoded data is out of scope of this document.
612
613
6143.3. Block Padding
615
616        Block Padding MUST contain 0-3 null bytes to make the size of
617        the Block a multiple of four bytes. This can be needed when
618        the size of Compressed Data is not a multiple of four. If any
619        of the bytes in Block Padding are not null bytes, the decoder
620        MUST indicate an error.
621
622
6233.4. Check
624
625        The type and size of the Check field depends on which bits
626        are set in the Stream Flags field (see Section 2.1.1.2).
627
628        The Check, when used, is calculated from the original
629        uncompressed data. If the calculated Check does not match the
630        stored one, the decoder MUST indicate an error. If the selected
631        type of Check is not supported by the decoder, it SHOULD
632        indicate a warning or error.
633
634
6354. Index
636
637        +-----------------+===================+
638        | Index Indicator | Number of Records |
639        +-----------------+===================+
640
641             +=================+===============+-+-+-+-+
642        ---> | List of Records | Index Padding | CRC32 |
643             +=================+===============+-+-+-+-+
644
645        Index serves several purposes. Using it, one can
646          - verify that all Blocks in a Stream have been processed;
647          - find out the uncompressed size of a Stream; and
648          - quickly access the beginning of any Block (random access).
649
650
6514.1. Index Indicator
652
653        This field overlaps with the Block Header Size field (see
654        Section 3.1.1). The value of Index Indicator is always 0x00.
655
656
6574.2. Number of Records
658
659        This field indicates how many Records there are in the List
660        of Records field, and thus how many Blocks there are in the
661        Stream. The value is stored using the encoding described in
662        Section 1.2. If the decoder has decoded all the Blocks of the
663        Stream, and then notices that the Number of Records doesn't
664        match the real number of Blocks, the decoder MUST indicate an
665        error.
666
667
6684.3. List of Records
669
670        List of Records consists of as many Records as indicated by the
671        Number of Records field:
672
673            +========+========+
674            | Record | Record | ...
675            +========+========+
676
677        Each Record contains information about one Block:
678
679            +===============+===================+
680            | Unpadded Size | Uncompressed Size |
681            +===============+===================+
682
683        If the decoder has decoded all the Blocks of the Stream, it
684        MUST verify that the contents of the Records match the real
685        Unpadded Size and Uncompressed Size of the respective Blocks.
686
687        Implementation hint: It is possible to verify the Index with
688        constant memory usage by calculating for example SHA-256 of
689        both the real size values and the List of Records, then
690        comparing the hash values. Implementing this using
691        non-cryptographic hash like CRC32 SHOULD be avoided unless
692        small code size is important.
693
694        If the decoder supports random-access reading, it MUST verify
695        that Unpadded Size and Uncompressed Size of every completely
696        decoded Block match the sizes stored in the Index. If only
697        partial Block is decoded, the decoder MUST verify that the
698        processed sizes don't exceed the sizes stored in the Index.
699
700
7014.3.1. Unpadded Size
702
703        This field indicates the size of the Block excluding the Block
704        Padding field. That is, Unpadded Size is the size of the Block
705        Header, Compressed Data, and Check fields. Unpadded Size is
706        stored using the encoding described in Section 1.2. The value
707        MUST never be zero; with the current structure of Blocks, the
708        actual minimum value for Unpadded Size is five.
709
710        Implementation note: Because the size of the Block Padding
711        field is not included in Unpadded Size, calculating the total
712        size of a Stream or doing random-access reading requires
713        calculating the actual size of the Blocks by rounding Unpadded
714        Sizes up to the next multiple of four.
715
716        The reason to exclude Block Padding from Unpadded Size is to
717        ease making a raw copy of Compressed Data without Block
718        Padding. This can be useful, for example, if someone wants
719        to convert Streams to some other file format quickly.
720
721
7224.3.2. Uncompressed Size
723
724        This field indicates the Uncompressed Size of the respective
725        Block as bytes. The value is stored using the encoding
726        described in Section 1.2.
727
728
7294.4. Index Padding
730
731        This field MUST contain 0-3 null bytes to pad the Index to
732        a multiple of four bytes. If any of the bytes are not null
733        bytes, the decoder MUST indicate an error.
734
735
7364.5. CRC32
737
738        The CRC32 is calculated over everything in the Index field
739        except the CRC32 field itself. The CRC32 is stored as an
740        unsigned 32-bit little endian integer. If the calculated
741        value does not match the stored one, the decoder MUST indicate
742        an error.
743
744
7455. Filter Chains
746
747        The Block Flags field defines how many filters are used. When
748        more than one filter is used, the filters are chained; that is,
749        the output of one filter is the input of another filter. The
750        following figure illustrates the direction of data flow.
751
752                    v   Uncompressed Data   ^
753                    |       Filter 0        |
754            Encoder |       Filter 1        | Decoder
755                    |       Filter n        |
756                    v    Compressed Data    ^
757
758
7595.1. Alignment
760
761        Alignment of uncompressed input data is usually the job of
762        the application producing the data. For example, to get the
763        best results, an archiver tool should make sure that all
764        PowerPC executable files in the archive stream start at
765        offsets that are multiples of four bytes.
766
767        Some filters, for example LZMA2, can be configured to take
768        advantage of specified alignment of input data. Note that
769        taking advantage of aligned input can be beneficial also when
770        a filter is not the first filter in the chain. For example,
771        if you compress PowerPC executables, you may want to use the
772        PowerPC filter and chain that with the LZMA2 filter. Because
773        not only the input but also the output alignment of the PowerPC
774        filter is four bytes, it is now beneficial to set LZMA2
775        settings so that the LZMA2 encoder can take advantage of its
776        four-byte-aligned input data.
777
778        The output of the last filter in the chain is stored to the
779        Compressed Data field, which is is guaranteed to be aligned
780        to a multiple of four bytes relative to the beginning of the
781        Stream. This can increase
782          - speed, if the filtered data is handled multiple bytes at
783            a time by the filter-specific encoder and decoder,
784            because accessing aligned data in computer memory is
785            usually faster; and
786          - compression ratio, if the output data is later compressed
787            with an external compression tool.
788
789
7905.2. Security
791
792        If filters would be allowed to be chained freely, it would be
793        possible to create malicious files, that would be very slow to
794        decode. Such files could be used to create denial of service
795        attacks.
796
797        Slow files could occur when multiple filters are chained:
798
799            v   Compressed input data
800            |   Filter 1 decoder (last filter)
801            |   Filter 0 decoder (non-last filter)
802            v   Uncompressed output data
803
804        The decoder of the last filter in the chain produces a lot of
805        output from little input. Another filter in the chain takes the
806        output of the last filter, and produces very little output
807        while consuming a lot of input. As a result, a lot of data is
808        moved inside the filter chain, but the filter chain as a whole
809        gets very little work done.
810
811        To prevent this kind of slow files, there are restrictions on
812        how the filters can be chained. These restrictions MUST be
813        taken into account when designing new filters.
814
815        The maximum number of filters in the chain has been limited to
816        four, thus there can be at maximum of three non-last filters.
817        Of these three non-last filters, only two are allowed to change
818        the size of the data.
819
820        The non-last filters, that change the size of the data, MUST
821        have a limit how much the decoder can compress the data: the
822        decoder SHOULD produce at least n bytes of output when the
823        filter is given 2n bytes of input. This  limit is not
824        absolute, but significant deviations MUST be avoided.
825
826        The above limitations guarantee that if the last filter in the
827        chain produces 4n bytes of output, the chain as a whole will
828        produce at least n bytes of output.
829
830
8315.3. Filters
832
8335.3.1. LZMA2
834
835        LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
836        compression algorithm with high compression ratio and fast
837        decompression. LZMA is based on LZ77 and range coding
838        algorithms.
839
840        LZMA2 is an extension on top of the original LZMA. LZMA2 uses
841        LZMA internally, but adds support for flushing the encoder,
842        uncompressed chunks, eases stateful decoder implementations,
843        and improves support for multithreading. Thus, the plain LZMA
844        will not be supported in this file format.
845
846            Filter ID:                  0x21
847            Size of Filter Properties:  1 byte
848            Changes size of data:       Yes
849            Allow as a non-last filter: No
850            Allow as the last filter:   Yes
851
852            Preferred alignment:
853                Input data:             Adjustable to 1/2/4/8/16 byte(s)
854                Output data:            1 byte
855
856        The format of the one-byte Filter Properties field is as
857        follows:
858
859            Bits   Mask   Description
860            0-5    0x3F   Dictionary Size
861            6-7    0xC0   Reserved for future use; MUST be zero for now.
862
863        Dictionary Size is encoded with one-bit mantissa and five-bit
864        exponent. The smallest dictionary size is 4 KiB and the biggest
865        is 4 GiB.
866
867            Raw value   Mantissa   Exponent   Dictionary size
868                0           2         11         4 KiB
869                1           3         11         6 KiB
870                2           2         12         8 KiB
871                3           3         12        12 KiB
872                4           2         13        16 KiB
873                5           3         13        24 KiB
874                6           2         14        32 KiB
875              ...         ...        ...      ...
876               35           3         27       768 MiB
877               36           2         28      1024 MiB
878               37           3         29      1536 MiB
879               38           2         30      2048 MiB
880               39           3         30      3072 MiB
881               40           2         31      4096 MiB - 1 B
882
883        Instead of having a table in the decoder, the dictionary size
884        can be decoded using the following C code:
885
886            const uint8_t bits = get_dictionary_flags() & 0x3F;
887            if (bits > 40)
888                return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
889
890            uint32_t dictionary_size;
891            if (bits == 40) {
892                dictionary_size = UINT32_MAX;
893            } else {
894                dictionary_size = 2 | (bits & 1);
895                dictionary_size <<= bits / 2 + 11;
896            }
897
898
8995.3.2. Branch/Call/Jump Filters for Executables
900
901        These filters convert relative branch, call, and jump
902        instructions to their absolute counterparts in executable
903        files. This conversion increases redundancy and thus
904        compression ratio.
905
906            Size of Filter Properties:  0 or 4 bytes
907            Changes size of data:       No
908            Allow as a non-last filter: Yes
909            Allow as the last filter:   No
910
911        Below is the list of filters in this category. The alignment
912        is the same for both input and output data.
913
914            Filter ID   Alignment   Description
915              0x04       1 byte     x86 filter (BCJ)
916              0x05       4 bytes    PowerPC (big endian) filter
917              0x06      16 bytes    IA64 filter
918              0x07       4 bytes    ARM (little endian) filter
919              0x08       2 bytes    ARM Thumb (little endian) filter
920              0x09       4 bytes    SPARC filter
921
922        If the size of Filter Properties is four bytes, the Filter
923        Properties field contains the start offset used for address
924        conversions. It is stored as an unsigned 32-bit little endian
925        integer. The start offset MUST be a multiple of the alignment
926        of the filter as listed in the table above; if it isn't, the
927        decoder MUST indicate an error. If the size of Filter
928        Properties is zero, the start offset is zero.
929
930        Setting the start offset may be useful if an executable has
931        multiple sections, and there are many cross-section calls.
932        Taking advantage of this feature usually requires usage of
933        the Subblock filter, whose design is not complete yet.
934
935
9365.3.3. Delta
937
938        The Delta filter may increase compression ratio when the value
939        of the next byte correlates with the value of an earlier byte
940        at specified distance.
941
942            Filter ID:                  0x03
943            Size of Filter Properties:  1 byte
944            Changes size of data:       No
945            Allow as a non-last filter: Yes
946            Allow as the last filter:   No
947
948            Preferred alignment:
949                Input data:             1 byte
950                Output data:            Same as the original input data
951
952        The Properties byte indicates the delta distance, which can be
953        1-256 bytes backwards from the current byte: 0x00 indicates
954        distance of 1 byte and 0xFF distance of 256 bytes.
955
956
9575.3.3.1. Format of the Encoded Output
958
959        The code below illustrates both encoding and decoding with
960        the Delta filter.
961
962            // Distance is in the range [1, 256].
963            const unsigned int distance = get_properties_byte() + 1;
964            uint8_t pos = 0;
965            uint8_t delta[256];
966
967            memset(delta, 0, sizeof(delta));
968
969            while (1) {
970                const int byte = read_byte();
971                if (byte == EOF)
972                    break;
973
974                uint8_t tmp = delta[(uint8_t)(distance + pos)];
975                if (is_encoder) {
976                    tmp = (uint8_t)(byte) - tmp;
977                    delta[pos] = (uint8_t)(byte);
978                } else {
979                    tmp = (uint8_t)(byte) + tmp;
980                    delta[pos] = tmp;
981                }
982
983                write_byte(tmp);
984                --pos;
985            }
986
987
9885.4. Custom Filter IDs
989
990        If a developer wants to use custom Filter IDs, he has two
991        choices. The first choice is to contact Lasse Collin and ask
992        him to allocate a range of IDs for the developer.
993
994        The second choice is to generate a 40-bit random integer,
995        which the developer can use as his personal Developer ID.
996        To minimize the risk of collisions, Developer ID has to be
997        a randomly generated integer, not manually selected "hex word".
998        The following command, which works on many free operating
999        systems, can be used to generate Developer ID:
1000
1001            dd if=/dev/urandom bs=5 count=1 | hexdump
1002
1003        The developer can then use his Developer ID to create unique
1004        (well, hopefully unique) Filter IDs.
1005
1006            Bits    Mask                    Description
1007             0-15   0x0000_0000_0000_FFFF   Filter ID
1008            16-55   0x00FF_FFFF_FFFF_0000   Developer ID
1009            56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
1010
1011        The resulting 63-bit integer will use 9 bytes of space when
1012        stored using the encoding described in Section 1.2. To get
1013        a shorter ID, see the beginning of this Section how to
1014        request a custom ID range.
1015
1016
10175.4.1. Reserved Custom Filter ID Ranges
1018
1019        Range                       Description
1020        0x0000_0300 - 0x0000_04FF   Reserved to ease .7z compatibility
1021        0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
1022        0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
1023
1024
10256. Cyclic Redundancy Checks
1026
1027        There are several incompatible variations to calculate CRC32
1028        and CRC64. For simplicity and clarity, complete examples are
1029        provided to calculate the checks as they are used in this file
1030        format. Implementations MAY use different code as long as it
1031        gives identical results.
1032
1033        The program below reads data from standard input, calculates
1034        the CRC32 and CRC64 values, and prints the calculated values
1035        as big endian hexadecimal strings to standard output.
1036
1037            #include <stddef.h>
1038            #include <inttypes.h>
1039            #include <stdio.h>
1040
1041            uint32_t crc32_table[256];
1042            uint64_t crc64_table[256];
1043
1044            void
1045            init(void)
1046            {
1047                static const uint32_t poly32 = UINT32_C(0xEDB88320);
1048                static const uint64_t poly64
1049                        = UINT64_C(0xC96C5795D7870F42);
1050
1051                for (size_t i = 0; i < 256; ++i) {
1052                    uint32_t crc32 = i;
1053                    uint64_t crc64 = i;
1054
1055                    for (size_t j = 0; j < 8; ++j) {
1056                        if (crc32 & 1)
1057                            crc32 = (crc32 >> 1) ^ poly32;
1058                        else
1059                            crc32 >>= 1;
1060
1061                        if (crc64 & 1)
1062                            crc64 = (crc64 >> 1) ^ poly64;
1063                        else
1064                            crc64 >>= 1;
1065                    }
1066
1067                    crc32_table[i] = crc32;
1068                    crc64_table[i] = crc64;
1069                }
1070            }
1071
1072            uint32_t
1073            crc32(const uint8_t *buf, size_t size, uint32_t crc)
1074            {
1075                crc = ~crc;
1076                for (size_t i = 0; i < size; ++i)
1077                    crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1078                            ^ (crc >> 8);
1079                return ~crc;
1080            }
1081
1082            uint64_t
1083            crc64(const uint8_t *buf, size_t size, uint64_t crc)
1084            {
1085                crc = ~crc;
1086                for (size_t i = 0; i < size; ++i)
1087                    crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1088                            ^ (crc >> 8);
1089                return ~crc;
1090            }
1091
1092            int
1093            main()
1094            {
1095                init();
1096
1097                uint32_t value32 = 0;
1098                uint64_t value64 = 0;
1099                uint64_t total_size = 0;
1100                uint8_t buf[8192];
1101
1102                while (1) {
1103                    const size_t buf_size
1104                            = fread(buf, 1, sizeof(buf), stdin);
1105                    if (buf_size == 0)
1106                        break;
1107
1108                    total_size += buf_size;
1109                    value32 = crc32(buf, buf_size, value32);
1110                    value64 = crc64(buf, buf_size, value64);
1111                }
1112
1113                printf("Bytes:  %" PRIu64 "\n", total_size);
1114                printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1115                printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1116
1117                return 0;
1118            }
1119
1120
11217. References
1122
1123        LZMA SDK - The original LZMA implementation
1124        http://7-zip.org/sdk.html
1125
1126        LZMA Utils - LZMA adapted to POSIX-like systems
1127        http://tukaani.org/lzma/
1128
1129        XZ Utils - The next generation of LZMA Utils
1130        http://tukaani.org/xz/
1131
1132        [RFC-1952]
1133        GZIP file format specification version 4.3
1134        http://www.ietf.org/rfc/rfc1952.txt
1135          - Notation of byte boxes in section "2.1. Overall conventions"
1136
1137        [RFC-2119]
1138        Key words for use in RFCs to Indicate Requirement Levels
1139        http://www.ietf.org/rfc/rfc2119.txt
1140
1141        [GNU-tar]
1142        GNU tar 1.21 manual
1143        http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1144          - Node 9.4.2 "Blocking Factor", paragraph that begins
1145            "gzip will complain about trailing garbage"
1146          - Note that this URL points to the latest version of the
1147            manual, and may some day not contain the note which is in
1148            1.21. For the exact version of the manual, download GNU
1149            tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz
1150
1151