1 2The .xz File Format 3=================== 4 5Version 1.0.4 (2009-08-27) 6 7 8 0. Preface 9 0.1. Notices and Acknowledgements 10 0.2. Getting the Latest Version 11 0.3. Version History 12 1. Conventions 13 1.1. Byte and Its Representation 14 1.2. Multibyte Integers 15 2. Overall Structure of .xz File 16 2.1. Stream 17 2.1.1. Stream Header 18 2.1.1.1. Header Magic Bytes 19 2.1.1.2. Stream Flags 20 2.1.1.3. CRC32 21 2.1.2. Stream Footer 22 2.1.2.1. CRC32 23 2.1.2.2. Backward Size 24 2.1.2.3. Stream Flags 25 2.1.2.4. Footer Magic Bytes 26 2.2. Stream Padding 27 3. Block 28 3.1. Block Header 29 3.1.1. Block Header Size 30 3.1.2. Block Flags 31 3.1.3. Compressed Size 32 3.1.4. Uncompressed Size 33 3.1.5. List of Filter Flags 34 3.1.6. Header Padding 35 3.1.7. CRC32 36 3.2. Compressed Data 37 3.3. Block Padding 38 3.4. Check 39 4. Index 40 4.1. Index Indicator 41 4.2. Number of Records 42 4.3. List of Records 43 4.3.1. Unpadded Size 44 4.3.2. Uncompressed Size 45 4.4. Index Padding 46 4.5. CRC32 47 5. Filter Chains 48 5.1. Alignment 49 5.2. Security 50 5.3. Filters 51 5.3.1. LZMA2 52 5.3.2. Branch/Call/Jump Filters for Executables 53 5.3.3. Delta 54 5.3.3.1. Format of the Encoded Output 55 5.4. Custom Filter IDs 56 5.4.1. Reserved Custom Filter ID Ranges 57 6. Cyclic Redundancy Checks 58 7. References 59 60 610. Preface 62 63 This document describes the .xz file format (filename suffix 64 ".xz", MIME type "application/x-xz"). It is intended that this 65 this format replace the old .lzma format used by LZMA SDK and 66 LZMA Utils. 67 68 690.1. Notices and Acknowledgements 70 71 This file format was designed by Lasse Collin 72 <lasse.collin@tukaani.org> and Igor Pavlov. 73 74 Special thanks for helping with this document goes to 75 Ville Koskinen. Thanks for helping with this document goes to 76 Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius. 77 78 This document has been put into the public domain. 79 80 810.2. Getting the Latest Version 82 83 The latest official version of this document can be downloaded 84 from <http://tukaani.org/xz/xz-file-format.txt>. 85 86 Specific versions of this document have a filename 87 xz-file-format-X.Y.Z.txt where X.Y.Z is the version number. 88 For example, the version 1.0.0 of this document is available 89 at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>. 90 91 920.3. Version History 93 94 Version Date Description 95 96 1.0.4 2009-08-27 Language improvements in Sections 1.2, 97 2.1.1.2, 3.1.1, 3.1.2, and 5.3.1 98 99 1.0.3 2009-06-05 Spelling fixes in Sections 5.1 and 5.4 100 101 1.0.2 2009-06-04 Typo fixes in Sections 4 and 5.3.1 102 103 1.0.1 2009-06-01 Typo fix in Section 0.3 and minor 104 clarifications to Sections 2, 2.2, 105 3.3, 4.4, and 5.3.2 106 107 1.0.0 2009-01-14 The first official version 108 109 1101. Conventions 111 112 The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", 113 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 114 document are to be interpreted as described in [RFC-2119]. 115 116 Indicating a warning means displaying a message, returning 117 appropriate exit status, or doing something else to let the 118 user know that something worth warning occurred. The operation 119 SHOULD still finish if a warning is indicated. 120 121 Indicating an error means displaying a message, returning 122 appropriate exit status, or doing something else to let the 123 user know that something prevented successfully finishing the 124 operation. The operation MUST be aborted once an error has 125 been indicated. 126 127 1281.1. Byte and Its Representation 129 130 In this document, byte is always 8 bits. 131 132 A "null byte" has all bits unset. That is, the value of a null 133 byte is 0x00. 134 135 To represent byte blocks, this document uses notation that 136 is similar to the notation used in [RFC-1952]: 137 138 +-------+ 139 | Foo | One byte. 140 +-------+ 141 142 +---+---+ 143 | Foo | Two bytes; that is, some of the vertical bars 144 +---+---+ can be missing. 145 146 +=======+ 147 | Foo | Zero or more bytes. 148 +=======+ 149 150 In this document, a boxed byte or a byte sequence declared 151 using this notation is called "a field". The example field 152 above would be called "the Foo field" or plain "Foo". 153 154 If there are many fields, they may be split to multiple lines. 155 This is indicated with an arrow ("--->"): 156 157 +=====+ 158 | Foo | 159 +=====+ 160 161 +=====+ 162 ---> | Bar | 163 +=====+ 164 165 The above is equivalent to this: 166 167 +=====+=====+ 168 | Foo | Bar | 169 +=====+=====+ 170 171 1721.2. Multibyte Integers 173 174 Multibyte integers of static length, such as CRC values, 175 are stored in little endian byte order (least significant 176 byte first). 177 178 When smaller values are more likely than bigger values (for 179 example file sizes), multibyte integers are encoded in a 180 variable-length representation: 181 - Numbers in the range [0, 127] are copied as is, and take 182 one byte of space. 183 - Bigger numbers will occupy two or more bytes. All but the 184 last byte of the multibyte representation have the highest 185 (eighth) bit set. 186 187 For now, the value of the variable-length integers is limited 188 to 63 bits, which limits the encoded size of the integer to 189 nine bytes. These limits may be increased in the future if 190 needed. 191 192 The following C code illustrates encoding and decoding of 193 variable-length integers. The functions return the number of 194 bytes occupied by the integer (1-9), or zero on error. 195 196 #include <stddef.h> 197 #include <inttypes.h> 198 199 size_t 200 encode(uint8_t buf[static 9], uint64_t num) 201 { 202 if (num > UINT64_MAX / 2) 203 return 0; 204 205 size_t i = 0; 206 207 while (num >= 0x80) { 208 buf[i++] = (uint8_t)(num) | 0x80; 209 num >>= 7; 210 } 211 212 buf[i++] = (uint8_t)(num); 213 214 return i; 215 } 216 217 size_t 218 decode(const uint8_t buf[], size_t size_max, uint64_t *num) 219 { 220 if (size_max == 0) 221 return 0; 222 223 if (size_max > 9) 224 size_max = 9; 225 226 *num = buf[0] & 0x7F; 227 size_t i = 0; 228 229 while (buf[i++] & 0x80) { 230 if (i >= size_max || buf[i] == 0x00) 231 return 0; 232 233 *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7); 234 } 235 236 return i; 237 } 238 239 2402. Overall Structure of .xz File 241 242 A standalone .xz files consist of one or more Streams which may 243 have Stream Padding between or after them: 244 245 +========+================+========+================+ 246 | Stream | Stream Padding | Stream | Stream Padding | ... 247 +========+================+========+================+ 248 249 The sizes of Stream and Stream Padding are always multiples 250 of four bytes, thus the size of every valid .xz file MUST be 251 a multiple of four bytes. 252 253 While a typical file contains only one Stream and no Stream 254 Padding, a decoder handling standalone .xz files SHOULD support 255 files that have more than one Stream or Stream Padding. 256 257 In contrast to standalone .xz files, when the .xz file format 258 is used as an internal part of some other file format or 259 communication protocol, it usually is expected that the decoder 260 stops after the first Stream, and doesn't look for Stream 261 Padding or possibly other Streams. 262 263 2642.1. Stream 265 266 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+ 267 | Stream Header | Block | Block | ... | Block | 268 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+ 269 270 +=======+-+-+-+-+-+-+-+-+-+-+-+-+ 271 ---> | Index | Stream Footer | 272 +=======+-+-+-+-+-+-+-+-+-+-+-+-+ 273 274 All the above fields have a size that is a multiple of four. If 275 Stream is used as an internal part of another file format, it 276 is RECOMMENDED to make the Stream start at an offset that is 277 a multiple of four bytes. 278 279 Stream Header, Index, and Stream Footer are always present in 280 a Stream. The maximum size of the Index field is 16 GiB (2^34). 281 282 There are zero or more Blocks. The maximum number of Blocks is 283 limited only by the maximum size of the Index field. 284 285 Total size of a Stream MUST be less than 8 EiB (2^63 bytes). 286 The same limit applies to the total amount of uncompressed 287 data stored in a Stream. 288 289 If an implementation supports handling .xz files with multiple 290 concatenated Streams, it MAY apply the above limits to the file 291 as a whole instead of limiting per Stream basis. 292 293 2942.1.1. Stream Header 295 296 +---+---+---+---+---+---+-------+------+--+--+--+--+ 297 | Header Magic Bytes | Stream Flags | CRC32 | 298 +---+---+---+---+---+---+-------+------+--+--+--+--+ 299 300 3012.1.1.1. Header Magic Bytes 302 303 The first six (6) bytes of the Stream are so called Header 304 Magic Bytes. They can be used to identify the file type. 305 306 Using a C array and ASCII: 307 const uint8_t HEADER_MAGIC[6] 308 = { 0xFD, '7', 'z', 'X', 'Z', 0x00 }; 309 310 In plain hexadecimal: 311 FD 37 7A 58 5A 00 312 313 Notes: 314 - The first byte (0xFD) was chosen so that the files cannot 315 be erroneously detected as being in .lzma format, in which 316 the first byte is in the range [0x00, 0xE0]. 317 - The sixth byte (0x00) was chosen to prevent applications 318 from misdetecting the file as a text file. 319 320 If the Header Magic Bytes don't match, the decoder MUST 321 indicate an error. 322 323 3242.1.1.2. Stream Flags 325 326 The first byte of Stream Flags is always a null byte. In the 327 future, this byte may be used to indicate a new Stream version 328 or other Stream properties. 329 330 The second byte of Stream Flags is a bit field: 331 332 Bit(s) Mask Description 333 0-3 0x0F Type of Check (see Section 3.4): 334 ID Size Check name 335 0x00 0 bytes None 336 0x01 4 bytes CRC32 337 0x02 4 bytes (Reserved) 338 0x03 4 bytes (Reserved) 339 0x04 8 bytes CRC64 340 0x05 8 bytes (Reserved) 341 0x06 8 bytes (Reserved) 342 0x07 16 bytes (Reserved) 343 0x08 16 bytes (Reserved) 344 0x09 16 bytes (Reserved) 345 0x0A 32 bytes SHA-256 346 0x0B 32 bytes (Reserved) 347 0x0C 32 bytes (Reserved) 348 0x0D 64 bytes (Reserved) 349 0x0E 64 bytes (Reserved) 350 0x0F 64 bytes (Reserved) 351 4-7 0xF0 Reserved for future use; MUST be zero for now. 352 353 Implementations SHOULD support at least the Check IDs 0x00 354 (None) and 0x01 (CRC32). Supporting other Check IDs is 355 OPTIONAL. If an unsupported Check is used, the decoder SHOULD 356 indicate a warning or error. 357 358 If any reserved bit is set, the decoder MUST indicate an error. 359 It is possible that there is a new field present which the 360 decoder is not aware of, and can thus parse the Stream Header 361 incorrectly. 362 363 3642.1.1.3. CRC32 365 366 The CRC32 is calculated from the Stream Flags field. It is 367 stored as an unsigned 32-bit little endian integer. If the 368 calculated value does not match the stored one, the decoder 369 MUST indicate an error. 370 371 The idea is that Stream Flags would always be two bytes, even 372 if new features are needed. This way old decoders will be able 373 to verify the CRC32 calculated from Stream Flags, and thus 374 distinguish between corrupt files (CRC32 doesn't match) and 375 files that the decoder doesn't support (CRC32 matches but 376 Stream Flags has reserved bits set). 377 378 3792.1.2. Stream Footer 380 381 +-+-+-+-+---+---+---+---+-------+------+----------+---------+ 382 | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes | 383 +-+-+-+-+---+---+---+---+-------+------+----------+---------+ 384 385 3862.1.2.1. CRC32 387 388 The CRC32 is calculated from the Backward Size and Stream Flags 389 fields. It is stored as an unsigned 32-bit little endian 390 integer. If the calculated value does not match the stored one, 391 the decoder MUST indicate an error. 392 393 The reason to have the CRC32 field before the Backward Size and 394 Stream Flags fields is to keep the four-byte fields aligned to 395 a multiple of four bytes. 396 397 3982.1.2.2. Backward Size 399 400 Backward Size is stored as a 32-bit little endian integer, 401 which indicates the size of the Index field as multiple of 402 four bytes, minimum value being four bytes: 403 404 real_backward_size = (stored_backward_size + 1) * 4; 405 406 If the stored value does not match the real size of the Index 407 field, the decoder MUST indicate an error. 408 409 Using a fixed-size integer to store Backward Size makes 410 it slightly simpler to parse the Stream Footer when the 411 application needs to parse the Stream backwards. 412 413 4142.1.2.3. Stream Flags 415 416 This is a copy of the Stream Flags field from the Stream 417 Header. The information stored to Stream Flags is needed 418 when parsing the Stream backwards. The decoder MUST compare 419 the Stream Flags fields in both Stream Header and Stream 420 Footer, and indicate an error if they are not identical. 421 422 4232.1.2.4. Footer Magic Bytes 424 425 As the last step of the decoding process, the decoder MUST 426 verify the existence of Footer Magic Bytes. If they don't 427 match, an error MUST be indicated. 428 429 Using a C array and ASCII: 430 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' }; 431 432 In hexadecimal: 433 59 5A 434 435 The primary reason to have Footer Magic Bytes is to make 436 it easier to detect incomplete files quickly, without 437 uncompressing. If the file does not end with Footer Magic Bytes 438 (excluding Stream Padding described in Section 2.2), it cannot 439 be undamaged, unless someone has intentionally appended garbage 440 after the end of the Stream. 441 442 4432.2. Stream Padding 444 445 Only the decoders that support decoding of concatenated Streams 446 MUST support Stream Padding. 447 448 Stream Padding MUST contain only null bytes. To preserve the 449 four-byte alignment of consecutive Streams, the size of Stream 450 Padding MUST be a multiple of four bytes. Empty Stream Padding 451 is allowed. If these requirements are not met, the decoder MUST 452 indicate an error. 453 454 Note that non-empty Stream Padding is allowed at the end of the 455 file; there doesn't need to be a new Stream after non-empty 456 Stream Padding. This can be convenient in certain situations 457 [GNU-tar]. 458 459 The possibility of Stream Padding MUST be taken into account 460 when designing an application that parses Streams backwards, 461 and the application supports concatenated Streams. 462 463 4643. Block 465 466 +==============+=================+===============+=======+ 467 | Block Header | Compressed Data | Block Padding | Check | 468 +==============+=================+===============+=======+ 469 470 4713.1. Block Header 472 473 +-------------------+-------------+=================+ 474 | Block Header Size | Block Flags | Compressed Size | 475 +-------------------+-------------+=================+ 476 477 +===================+======================+ 478 ---> | Uncompressed Size | List of Filter Flags | 479 +===================+======================+ 480 481 +================+--+--+--+--+ 482 ---> | Header Padding | CRC32 | 483 +================+--+--+--+--+ 484 485 4863.1.1. Block Header Size 487 488 This field overlaps with the Index Indicator field (see 489 Section 4.1). 490 491 This field contains the size of the Block Header field, 492 including the Block Header Size field itself. Valid values are 493 in the range [0x01, 0xFF], which indicate the size of the Block 494 Header as multiples of four bytes, minimum size being eight 495 bytes: 496 497 real_header_size = (encoded_header_size + 1) * 4; 498 499 If a Block Header bigger than 1024 bytes is needed in the 500 future, a new field can be added between the Block Header and 501 Compressed Data fields. The presence of this new field would 502 be indicated in the Block Header field. 503 504 5053.1.2. Block Flags 506 507 The Block Flags field is a bit field: 508 509 Bit(s) Mask Description 510 0-1 0x03 Number of filters (1-4) 511 2-5 0x3C Reserved for future use; MUST be zero for now. 512 6 0x40 The Compressed Size field is present. 513 7 0x80 The Uncompressed Size field is present. 514 515 If any reserved bit is set, the decoder MUST indicate an error. 516 It is possible that there is a new field present which the 517 decoder is not aware of, and can thus parse the Block Header 518 incorrectly. 519 520 5213.1.3. Compressed Size 522 523 This field is present only if the appropriate bit is set in 524 the Block Flags field (see Section 3.1.2). 525 526 The Compressed Size field contains the size of the Compressed 527 Data field, which MUST be non-zero. Compressed Size is stored 528 using the encoding described in Section 1.2. If the Compressed 529 Size doesn't match the size of the Compressed Data field, the 530 decoder MUST indicate an error. 531 532 5333.1.4. Uncompressed Size 534 535 This field is present only if the appropriate bit is set in 536 the Block Flags field (see Section 3.1.2). 537 538 The Uncompressed Size field contains the size of the Block 539 after uncompressing. Uncompressed Size is stored using the 540 encoding described in Section 1.2. If the Uncompressed Size 541 does not match the real uncompressed size, the decoder MUST 542 indicate an error. 543 544 Storing the Compressed Size and Uncompressed Size fields serves 545 several purposes: 546 - The decoder knows how much memory it needs to allocate 547 for a temporary buffer in multithreaded mode. 548 - Simple error detection: wrong size indicates a broken file. 549 - Seeking forwards to a specific location in streamed mode. 550 551 It should be noted that the only reliable way to determine 552 the real uncompressed size is to uncompress the Block, 553 because the Block Header and Index fields may contain 554 (intentionally or unintentionally) invalid information. 555 556 5573.1.5. List of Filter Flags 558 559 +================+================+ +================+ 560 | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags | 561 +================+================+ +================+ 562 563 The number of Filter Flags fields is stored in the Block Flags 564 field (see Section 3.1.2). 565 566 The format of each Filter Flags field is as follows: 567 568 +===========+====================+===================+ 569 | Filter ID | Size of Properties | Filter Properties | 570 +===========+====================+===================+ 571 572 Both Filter ID and Size of Properties are stored using the 573 encoding described in Section 1.2. Size of Properties indicates 574 the size of the Filter Properties field as bytes. The list of 575 officially defined Filter IDs and the formats of their Filter 576 Properties are described in Section 5.3. 577 578 Filter IDs greater than or equal to 0x4000_0000_0000_0000 579 (2^62) are reserved for implementation-specific internal use. 580 These Filter IDs MUST never be used in List of Filter Flags. 581 582 5833.1.6. Header Padding 584 585 This field contains as many null byte as it is needed to make 586 the Block Header have the size specified in Block Header Size. 587 If any of the bytes are not null bytes, the decoder MUST 588 indicate an error. It is possible that there is a new field 589 present which the decoder is not aware of, and can thus parse 590 the Block Header incorrectly. 591 592 5933.1.7. CRC32 594 595 The CRC32 is calculated over everything in the Block Header 596 field except the CRC32 field itself. It is stored as an 597 unsigned 32-bit little endian integer. If the calculated 598 value does not match the stored one, the decoder MUST indicate 599 an error. 600 601 By verifying the CRC32 of the Block Header before parsing the 602 actual contents allows the decoder to distinguish between 603 corrupt and unsupported files. 604 605 6063.2. Compressed Data 607 608 The format of Compressed Data depends on Block Flags and List 609 of Filter Flags. Excluding the descriptions of the simplest 610 filters in Section 5.3, the format of the filter-specific 611 encoded data is out of scope of this document. 612 613 6143.3. Block Padding 615 616 Block Padding MUST contain 0-3 null bytes to make the size of 617 the Block a multiple of four bytes. This can be needed when 618 the size of Compressed Data is not a multiple of four. If any 619 of the bytes in Block Padding are not null bytes, the decoder 620 MUST indicate an error. 621 622 6233.4. Check 624 625 The type and size of the Check field depends on which bits 626 are set in the Stream Flags field (see Section 2.1.1.2). 627 628 The Check, when used, is calculated from the original 629 uncompressed data. If the calculated Check does not match the 630 stored one, the decoder MUST indicate an error. If the selected 631 type of Check is not supported by the decoder, it SHOULD 632 indicate a warning or error. 633 634 6354. Index 636 637 +-----------------+===================+ 638 | Index Indicator | Number of Records | 639 +-----------------+===================+ 640 641 +=================+===============+-+-+-+-+ 642 ---> | List of Records | Index Padding | CRC32 | 643 +=================+===============+-+-+-+-+ 644 645 Index serves several purposes. Using it, one can 646 - verify that all Blocks in a Stream have been processed; 647 - find out the uncompressed size of a Stream; and 648 - quickly access the beginning of any Block (random access). 649 650 6514.1. Index Indicator 652 653 This field overlaps with the Block Header Size field (see 654 Section 3.1.1). The value of Index Indicator is always 0x00. 655 656 6574.2. Number of Records 658 659 This field indicates how many Records there are in the List 660 of Records field, and thus how many Blocks there are in the 661 Stream. The value is stored using the encoding described in 662 Section 1.2. If the decoder has decoded all the Blocks of the 663 Stream, and then notices that the Number of Records doesn't 664 match the real number of Blocks, the decoder MUST indicate an 665 error. 666 667 6684.3. List of Records 669 670 List of Records consists of as many Records as indicated by the 671 Number of Records field: 672 673 +========+========+ 674 | Record | Record | ... 675 +========+========+ 676 677 Each Record contains information about one Block: 678 679 +===============+===================+ 680 | Unpadded Size | Uncompressed Size | 681 +===============+===================+ 682 683 If the decoder has decoded all the Blocks of the Stream, it 684 MUST verify that the contents of the Records match the real 685 Unpadded Size and Uncompressed Size of the respective Blocks. 686 687 Implementation hint: It is possible to verify the Index with 688 constant memory usage by calculating for example SHA-256 of 689 both the real size values and the List of Records, then 690 comparing the hash values. Implementing this using 691 non-cryptographic hash like CRC32 SHOULD be avoided unless 692 small code size is important. 693 694 If the decoder supports random-access reading, it MUST verify 695 that Unpadded Size and Uncompressed Size of every completely 696 decoded Block match the sizes stored in the Index. If only 697 partial Block is decoded, the decoder MUST verify that the 698 processed sizes don't exceed the sizes stored in the Index. 699 700 7014.3.1. Unpadded Size 702 703 This field indicates the size of the Block excluding the Block 704 Padding field. That is, Unpadded Size is the size of the Block 705 Header, Compressed Data, and Check fields. Unpadded Size is 706 stored using the encoding described in Section 1.2. The value 707 MUST never be zero; with the current structure of Blocks, the 708 actual minimum value for Unpadded Size is five. 709 710 Implementation note: Because the size of the Block Padding 711 field is not included in Unpadded Size, calculating the total 712 size of a Stream or doing random-access reading requires 713 calculating the actual size of the Blocks by rounding Unpadded 714 Sizes up to the next multiple of four. 715 716 The reason to exclude Block Padding from Unpadded Size is to 717 ease making a raw copy of Compressed Data without Block 718 Padding. This can be useful, for example, if someone wants 719 to convert Streams to some other file format quickly. 720 721 7224.3.2. Uncompressed Size 723 724 This field indicates the Uncompressed Size of the respective 725 Block as bytes. The value is stored using the encoding 726 described in Section 1.2. 727 728 7294.4. Index Padding 730 731 This field MUST contain 0-3 null bytes to pad the Index to 732 a multiple of four bytes. If any of the bytes are not null 733 bytes, the decoder MUST indicate an error. 734 735 7364.5. CRC32 737 738 The CRC32 is calculated over everything in the Index field 739 except the CRC32 field itself. The CRC32 is stored as an 740 unsigned 32-bit little endian integer. If the calculated 741 value does not match the stored one, the decoder MUST indicate 742 an error. 743 744 7455. Filter Chains 746 747 The Block Flags field defines how many filters are used. When 748 more than one filter is used, the filters are chained; that is, 749 the output of one filter is the input of another filter. The 750 following figure illustrates the direction of data flow. 751 752 v Uncompressed Data ^ 753 | Filter 0 | 754 Encoder | Filter 1 | Decoder 755 | Filter n | 756 v Compressed Data ^ 757 758 7595.1. Alignment 760 761 Alignment of uncompressed input data is usually the job of 762 the application producing the data. For example, to get the 763 best results, an archiver tool should make sure that all 764 PowerPC executable files in the archive stream start at 765 offsets that are multiples of four bytes. 766 767 Some filters, for example LZMA2, can be configured to take 768 advantage of specified alignment of input data. Note that 769 taking advantage of aligned input can be beneficial also when 770 a filter is not the first filter in the chain. For example, 771 if you compress PowerPC executables, you may want to use the 772 PowerPC filter and chain that with the LZMA2 filter. Because 773 not only the input but also the output alignment of the PowerPC 774 filter is four bytes, it is now beneficial to set LZMA2 775 settings so that the LZMA2 encoder can take advantage of its 776 four-byte-aligned input data. 777 778 The output of the last filter in the chain is stored to the 779 Compressed Data field, which is is guaranteed to be aligned 780 to a multiple of four bytes relative to the beginning of the 781 Stream. This can increase 782 - speed, if the filtered data is handled multiple bytes at 783 a time by the filter-specific encoder and decoder, 784 because accessing aligned data in computer memory is 785 usually faster; and 786 - compression ratio, if the output data is later compressed 787 with an external compression tool. 788 789 7905.2. Security 791 792 If filters would be allowed to be chained freely, it would be 793 possible to create malicious files, that would be very slow to 794 decode. Such files could be used to create denial of service 795 attacks. 796 797 Slow files could occur when multiple filters are chained: 798 799 v Compressed input data 800 | Filter 1 decoder (last filter) 801 | Filter 0 decoder (non-last filter) 802 v Uncompressed output data 803 804 The decoder of the last filter in the chain produces a lot of 805 output from little input. Another filter in the chain takes the 806 output of the last filter, and produces very little output 807 while consuming a lot of input. As a result, a lot of data is 808 moved inside the filter chain, but the filter chain as a whole 809 gets very little work done. 810 811 To prevent this kind of slow files, there are restrictions on 812 how the filters can be chained. These restrictions MUST be 813 taken into account when designing new filters. 814 815 The maximum number of filters in the chain has been limited to 816 four, thus there can be at maximum of three non-last filters. 817 Of these three non-last filters, only two are allowed to change 818 the size of the data. 819 820 The non-last filters, that change the size of the data, MUST 821 have a limit how much the decoder can compress the data: the 822 decoder SHOULD produce at least n bytes of output when the 823 filter is given 2n bytes of input. This limit is not 824 absolute, but significant deviations MUST be avoided. 825 826 The above limitations guarantee that if the last filter in the 827 chain produces 4n bytes of output, the chain as a whole will 828 produce at least n bytes of output. 829 830 8315.3. Filters 832 8335.3.1. LZMA2 834 835 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose 836 compression algorithm with high compression ratio and fast 837 decompression. LZMA is based on LZ77 and range coding 838 algorithms. 839 840 LZMA2 is an extension on top of the original LZMA. LZMA2 uses 841 LZMA internally, but adds support for flushing the encoder, 842 uncompressed chunks, eases stateful decoder implementations, 843 and improves support for multithreading. Thus, the plain LZMA 844 will not be supported in this file format. 845 846 Filter ID: 0x21 847 Size of Filter Properties: 1 byte 848 Changes size of data: Yes 849 Allow as a non-last filter: No 850 Allow as the last filter: Yes 851 852 Preferred alignment: 853 Input data: Adjustable to 1/2/4/8/16 byte(s) 854 Output data: 1 byte 855 856 The format of the one-byte Filter Properties field is as 857 follows: 858 859 Bits Mask Description 860 0-5 0x3F Dictionary Size 861 6-7 0xC0 Reserved for future use; MUST be zero for now. 862 863 Dictionary Size is encoded with one-bit mantissa and five-bit 864 exponent. The smallest dictionary size is 4 KiB and the biggest 865 is 4 GiB. 866 867 Raw value Mantissa Exponent Dictionary size 868 0 2 11 4 KiB 869 1 3 11 6 KiB 870 2 2 12 8 KiB 871 3 3 12 12 KiB 872 4 2 13 16 KiB 873 5 3 13 24 KiB 874 6 2 14 32 KiB 875 ... ... ... ... 876 35 3 27 768 MiB 877 36 2 28 1024 MiB 878 37 3 29 1536 MiB 879 38 2 30 2048 MiB 880 39 3 30 3072 MiB 881 40 2 31 4096 MiB - 1 B 882 883 Instead of having a table in the decoder, the dictionary size 884 can be decoded using the following C code: 885 886 const uint8_t bits = get_dictionary_flags() & 0x3F; 887 if (bits > 40) 888 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB 889 890 uint32_t dictionary_size; 891 if (bits == 40) { 892 dictionary_size = UINT32_MAX; 893 } else { 894 dictionary_size = 2 | (bits & 1); 895 dictionary_size <<= bits / 2 + 11; 896 } 897 898 8995.3.2. Branch/Call/Jump Filters for Executables 900 901 These filters convert relative branch, call, and jump 902 instructions to their absolute counterparts in executable 903 files. This conversion increases redundancy and thus 904 compression ratio. 905 906 Size of Filter Properties: 0 or 4 bytes 907 Changes size of data: No 908 Allow as a non-last filter: Yes 909 Allow as the last filter: No 910 911 Below is the list of filters in this category. The alignment 912 is the same for both input and output data. 913 914 Filter ID Alignment Description 915 0x04 1 byte x86 filter (BCJ) 916 0x05 4 bytes PowerPC (big endian) filter 917 0x06 16 bytes IA64 filter 918 0x07 4 bytes ARM (little endian) filter 919 0x08 2 bytes ARM Thumb (little endian) filter 920 0x09 4 bytes SPARC filter 921 922 If the size of Filter Properties is four bytes, the Filter 923 Properties field contains the start offset used for address 924 conversions. It is stored as an unsigned 32-bit little endian 925 integer. The start offset MUST be a multiple of the alignment 926 of the filter as listed in the table above; if it isn't, the 927 decoder MUST indicate an error. If the size of Filter 928 Properties is zero, the start offset is zero. 929 930 Setting the start offset may be useful if an executable has 931 multiple sections, and there are many cross-section calls. 932 Taking advantage of this feature usually requires usage of 933 the Subblock filter, whose design is not complete yet. 934 935 9365.3.3. Delta 937 938 The Delta filter may increase compression ratio when the value 939 of the next byte correlates with the value of an earlier byte 940 at specified distance. 941 942 Filter ID: 0x03 943 Size of Filter Properties: 1 byte 944 Changes size of data: No 945 Allow as a non-last filter: Yes 946 Allow as the last filter: No 947 948 Preferred alignment: 949 Input data: 1 byte 950 Output data: Same as the original input data 951 952 The Properties byte indicates the delta distance, which can be 953 1-256 bytes backwards from the current byte: 0x00 indicates 954 distance of 1 byte and 0xFF distance of 256 bytes. 955 956 9575.3.3.1. Format of the Encoded Output 958 959 The code below illustrates both encoding and decoding with 960 the Delta filter. 961 962 // Distance is in the range [1, 256]. 963 const unsigned int distance = get_properties_byte() + 1; 964 uint8_t pos = 0; 965 uint8_t delta[256]; 966 967 memset(delta, 0, sizeof(delta)); 968 969 while (1) { 970 const int byte = read_byte(); 971 if (byte == EOF) 972 break; 973 974 uint8_t tmp = delta[(uint8_t)(distance + pos)]; 975 if (is_encoder) { 976 tmp = (uint8_t)(byte) - tmp; 977 delta[pos] = (uint8_t)(byte); 978 } else { 979 tmp = (uint8_t)(byte) + tmp; 980 delta[pos] = tmp; 981 } 982 983 write_byte(tmp); 984 --pos; 985 } 986 987 9885.4. Custom Filter IDs 989 990 If a developer wants to use custom Filter IDs, he has two 991 choices. The first choice is to contact Lasse Collin and ask 992 him to allocate a range of IDs for the developer. 993 994 The second choice is to generate a 40-bit random integer, 995 which the developer can use as his personal Developer ID. 996 To minimize the risk of collisions, Developer ID has to be 997 a randomly generated integer, not manually selected "hex word". 998 The following command, which works on many free operating 999 systems, can be used to generate Developer ID: 1000 1001 dd if=/dev/urandom bs=5 count=1 | hexdump 1002 1003 The developer can then use his Developer ID to create unique 1004 (well, hopefully unique) Filter IDs. 1005 1006 Bits Mask Description 1007 0-15 0x0000_0000_0000_FFFF Filter ID 1008 16-55 0x00FF_FFFF_FFFF_0000 Developer ID 1009 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F 1010 1011 The resulting 63-bit integer will use 9 bytes of space when 1012 stored using the encoding described in Section 1.2. To get 1013 a shorter ID, see the beginning of this Section how to 1014 request a custom ID range. 1015 1016 10175.4.1. Reserved Custom Filter ID Ranges 1018 1019 Range Description 1020 0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility 1021 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility 1022 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility 1023 1024 10256. Cyclic Redundancy Checks 1026 1027 There are several incompatible variations to calculate CRC32 1028 and CRC64. For simplicity and clarity, complete examples are 1029 provided to calculate the checks as they are used in this file 1030 format. Implementations MAY use different code as long as it 1031 gives identical results. 1032 1033 The program below reads data from standard input, calculates 1034 the CRC32 and CRC64 values, and prints the calculated values 1035 as big endian hexadecimal strings to standard output. 1036 1037 #include <stddef.h> 1038 #include <inttypes.h> 1039 #include <stdio.h> 1040 1041 uint32_t crc32_table[256]; 1042 uint64_t crc64_table[256]; 1043 1044 void 1045 init(void) 1046 { 1047 static const uint32_t poly32 = UINT32_C(0xEDB88320); 1048 static const uint64_t poly64 1049 = UINT64_C(0xC96C5795D7870F42); 1050 1051 for (size_t i = 0; i < 256; ++i) { 1052 uint32_t crc32 = i; 1053 uint64_t crc64 = i; 1054 1055 for (size_t j = 0; j < 8; ++j) { 1056 if (crc32 & 1) 1057 crc32 = (crc32 >> 1) ^ poly32; 1058 else 1059 crc32 >>= 1; 1060 1061 if (crc64 & 1) 1062 crc64 = (crc64 >> 1) ^ poly64; 1063 else 1064 crc64 >>= 1; 1065 } 1066 1067 crc32_table[i] = crc32; 1068 crc64_table[i] = crc64; 1069 } 1070 } 1071 1072 uint32_t 1073 crc32(const uint8_t *buf, size_t size, uint32_t crc) 1074 { 1075 crc = ~crc; 1076 for (size_t i = 0; i < size; ++i) 1077 crc = crc32_table[buf[i] ^ (crc & 0xFF)] 1078 ^ (crc >> 8); 1079 return ~crc; 1080 } 1081 1082 uint64_t 1083 crc64(const uint8_t *buf, size_t size, uint64_t crc) 1084 { 1085 crc = ~crc; 1086 for (size_t i = 0; i < size; ++i) 1087 crc = crc64_table[buf[i] ^ (crc & 0xFF)] 1088 ^ (crc >> 8); 1089 return ~crc; 1090 } 1091 1092 int 1093 main() 1094 { 1095 init(); 1096 1097 uint32_t value32 = 0; 1098 uint64_t value64 = 0; 1099 uint64_t total_size = 0; 1100 uint8_t buf[8192]; 1101 1102 while (1) { 1103 const size_t buf_size 1104 = fread(buf, 1, sizeof(buf), stdin); 1105 if (buf_size == 0) 1106 break; 1107 1108 total_size += buf_size; 1109 value32 = crc32(buf, buf_size, value32); 1110 value64 = crc64(buf, buf_size, value64); 1111 } 1112 1113 printf("Bytes: %" PRIu64 "\n", total_size); 1114 printf("CRC-32: 0x%08" PRIX32 "\n", value32); 1115 printf("CRC-64: 0x%016" PRIX64 "\n", value64); 1116 1117 return 0; 1118 } 1119 1120 11217. References 1122 1123 LZMA SDK - The original LZMA implementation 1124 http://7-zip.org/sdk.html 1125 1126 LZMA Utils - LZMA adapted to POSIX-like systems 1127 http://tukaani.org/lzma/ 1128 1129 XZ Utils - The next generation of LZMA Utils 1130 http://tukaani.org/xz/ 1131 1132 [RFC-1952] 1133 GZIP file format specification version 4.3 1134 http://www.ietf.org/rfc/rfc1952.txt 1135 - Notation of byte boxes in section "2.1. Overall conventions" 1136 1137 [RFC-2119] 1138 Key words for use in RFCs to Indicate Requirement Levels 1139 http://www.ietf.org/rfc/rfc2119.txt 1140 1141 [GNU-tar] 1142 GNU tar 1.21 manual 1143 http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html 1144 - Node 9.4.2 "Blocking Factor", paragraph that begins 1145 "gzip will complain about trailing garbage" 1146 - Note that this URL points to the latest version of the 1147 manual, and may some day not contain the note which is in 1148 1.21. For the exact version of the manual, download GNU 1149 tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz 1150 1151