178556Sobrien.PU 278556Sobrien.TH bzip2 1 378556Sobrien.SH NAME 4215041Sobrienbzip2, bunzip2 \- a block-sorting file compressor, v1.0.6 578556Sobrien.br 678556Sobrienbzcat \- decompresses files to stdout 778556Sobrien.br 878556Sobrienbzip2recover \- recovers data from damaged bzip2 files 978556Sobrien 1078556Sobrien.SH SYNOPSIS 1178556Sobrien.ll +8 1278556Sobrien.B bzip2 1378556Sobrien.RB [ " \-cdfkqstvzVL123456789 " ] 1478556Sobrien[ 1578556Sobrien.I "filenames \&..." 1678556Sobrien] 1778556Sobrien.ll -8 1878556Sobrien.br 1978556Sobrien.B bunzip2 2078556Sobrien.RB [ " \-fkvsVL " ] 2178556Sobrien[ 2278556Sobrien.I "filenames \&..." 2378556Sobrien] 2478556Sobrien.br 2578556Sobrien.B bzcat 2678556Sobrien.RB [ " \-s " ] 2778556Sobrien[ 2878556Sobrien.I "filenames \&..." 2978556Sobrien] 3078556Sobrien.br 3178556Sobrien.B bzip2recover 3278556Sobrien.I "filename" 3378556Sobrien 3478556Sobrien.SH DESCRIPTION 3578556Sobrien.I bzip2 3678556Sobriencompresses files using the Burrows-Wheeler block sorting 3778556Sobrientext compression algorithm, and Huffman coding. Compression is 3878556Sobriengenerally considerably better than that achieved by more conventional 3978556SobrienLZ77/LZ78-based compressors, and approaches the performance of the PPM 4078556Sobrienfamily of statistical compressors. 4178556Sobrien 4278556SobrienThe command-line options are deliberately very similar to 4378556Sobrienthose of 4478556Sobrien.I GNU gzip, 4578556Sobrienbut they are not identical. 4678556Sobrien 4778556Sobrien.I bzip2 4878556Sobrienexpects a list of file names to accompany the 4978556Sobriencommand-line flags. Each file is replaced by a compressed version of 5078556Sobrienitself, with the name "original_name.bz2". 5178556SobrienEach compressed file 5278556Sobrienhas the same modification date, permissions, and, when possible, 5378556Sobrienownership as the corresponding original, so that these properties can 5478556Sobrienbe correctly restored at decompression time. File name handling is 5578556Sobriennaive in the sense that there is no mechanism for preserving original 5678556Sobrienfile names, permissions, ownerships or dates in filesystems which lack 5778556Sobrienthese concepts, or have serious file name length restrictions, such as 5878556SobrienMS-DOS. 5978556Sobrien 6078556Sobrien.I bzip2 6178556Sobrienand 6278556Sobrien.I bunzip2 6378556Sobrienwill by default not overwrite existing 6478556Sobrienfiles. If you want this to happen, specify the \-f flag. 6578556Sobrien 6678556SobrienIf no file names are specified, 6778556Sobrien.I bzip2 6878556Sobriencompresses from standard 6978556Sobrieninput to standard output. In this case, 7078556Sobrien.I bzip2 7178556Sobrienwill decline to 7278556Sobrienwrite compressed output to a terminal, as this would be entirely 7378556Sobrienincomprehensible and therefore pointless. 7478556Sobrien 7578556Sobrien.I bunzip2 7678556Sobrien(or 7778556Sobrien.I bzip2 \-d) 7878556Sobriendecompresses all 7978556Sobrienspecified files. Files which were not created by 8078556Sobrien.I bzip2 8178556Sobrienwill be detected and ignored, and a warning issued. 8278556Sobrien.I bzip2 8378556Sobrienattempts to guess the filename for the decompressed file 8478556Sobrienfrom that of the compressed file as follows: 8578556Sobrien 8678556Sobrien filename.bz2 becomes filename 8778556Sobrien filename.bz becomes filename 8878556Sobrien filename.tbz2 becomes filename.tar 8978556Sobrien filename.tbz becomes filename.tar 9078556Sobrien anyothername becomes anyothername.out 9178556Sobrien 9278556SobrienIf the file does not end in one of the recognised endings, 9378556Sobrien.I .bz2, 9478556Sobrien.I .bz, 9578556Sobrien.I .tbz2 9678556Sobrienor 9778556Sobrien.I .tbz, 9878556Sobrien.I bzip2 9978556Sobriencomplains that it cannot 10078556Sobrienguess the name of the original file, and uses the original name 10178556Sobrienwith 10278556Sobrien.I .out 10378556Sobrienappended. 10478556Sobrien 10578556SobrienAs with compression, supplying no 10678556Sobrienfilenames causes decompression from 10778556Sobrienstandard input to standard output. 10878556Sobrien 10978556Sobrien.I bunzip2 11078556Sobrienwill correctly decompress a file which is the 11178556Sobrienconcatenation of two or more compressed files. The result is the 11278556Sobrienconcatenation of the corresponding uncompressed files. Integrity 11378556Sobrientesting (\-t) 11478556Sobrienof concatenated 11578556Sobriencompressed files is also supported. 11678556Sobrien 11778556SobrienYou can also compress or decompress files to the standard output by 11878556Sobriengiving the \-c flag. Multiple files may be compressed and 11978556Sobriendecompressed like this. The resulting outputs are fed sequentially to 12078556Sobrienstdout. Compression of multiple files 12178556Sobrienin this manner generates a stream 12278556Sobriencontaining multiple compressed file representations. Such a stream 12378556Sobriencan be decompressed correctly only by 12478556Sobrien.I bzip2 12578556Sobrienversion 0.9.0 or 12678556Sobrienlater. Earlier versions of 12778556Sobrien.I bzip2 12878556Sobrienwill stop after decompressing 12978556Sobrienthe first file in the stream. 13078556Sobrien 13178556Sobrien.I bzcat 13278556Sobrien(or 13378556Sobrien.I bzip2 -dc) 13478556Sobriendecompresses all specified files to 13578556Sobrienthe standard output. 13678556Sobrien 13778556Sobrien.I bzip2 13878556Sobrienwill read arguments from the environment variables 13978556Sobrien.I BZIP2 14078556Sobrienand 14178556Sobrien.I BZIP, 14278556Sobrienin that order, and will process them 14378556Sobrienbefore any arguments read from the command line. This gives a 14478556Sobrienconvenient way to supply default arguments. 14578556Sobrien 14678556SobrienCompression is always performed, even if the compressed 14778556Sobrienfile is slightly 14878556Sobrienlarger than the original. Files of less than about one hundred bytes 14978556Sobrientend to get larger, since the compression mechanism has a constant 15078556Sobrienoverhead in the region of 50 bytes. Random data (including the output 15178556Sobrienof most file compressors) is coded at about 8.05 bits per byte, giving 15278556Sobrienan expansion of around 0.5%. 15378556Sobrien 15478556SobrienAs a self-check for your protection, 15578556Sobrien.I 15678556Sobrienbzip2 15778556Sobrienuses 32-bit CRCs to 15878556Sobrienmake sure that the decompressed version of a file is identical to the 15978556Sobrienoriginal. This guards against corruption of the compressed data, and 16078556Sobrienagainst undetected bugs in 16178556Sobrien.I bzip2 16278556Sobrien(hopefully very unlikely). The 16378556Sobrienchances of data corruption going undetected is microscopic, about one 16478556Sobrienchance in four billion for each file processed. Be aware, though, that 16578556Sobrienthe check occurs upon decompression, so it can only tell you that 16678556Sobriensomething is wrong. It can't help you 16778556Sobrienrecover the original uncompressed 16878556Sobriendata. You can use 16978556Sobrien.I bzip2recover 17078556Sobriento try to recover data from 17178556Sobriendamaged files. 17278556Sobrien 17378556SobrienReturn values: 0 for a normal exit, 1 for environmental problems (file 17478556Sobriennot found, invalid flags, I/O errors, &c), 2 to indicate a corrupt 17578556Sobriencompressed file, 3 for an internal consistency error (eg, bug) which 17678556Sobriencaused 17778556Sobrien.I bzip2 17878556Sobriento panic. 17978556Sobrien 18078556Sobrien.SH OPTIONS 18178556Sobrien.TP 18278556Sobrien.B \-c --stdout 18378556SobrienCompress or decompress to standard output. 18478556Sobrien.TP 18578556Sobrien.B \-d --decompress 18678556SobrienForce decompression. 18778556Sobrien.I bzip2, 18878556Sobrien.I bunzip2 18978556Sobrienand 19078556Sobrien.I bzcat 19178556Sobrienare 19278556Sobrienreally the same program, and the decision about what actions to take is 19378556Sobriendone on the basis of which name is used. This flag overrides that 19478556Sobrienmechanism, and forces 19578556Sobrien.I bzip2 19678556Sobriento decompress. 19778556Sobrien.TP 19878556Sobrien.B \-z --compress 19978556SobrienThe complement to \-d: forces compression, regardless of the 20090067Ssobomaxinvocation name. 20178556Sobrien.TP 20278556Sobrien.B \-t --test 20378556SobrienCheck integrity of the specified file(s), but don't decompress them. 20478556SobrienThis really performs a trial decompression and throws away the result. 20578556Sobrien.TP 20678556Sobrien.B \-f --force 20778556SobrienForce overwrite of output files. Normally, 20878556Sobrien.I bzip2 20978556Sobrienwill not overwrite 21078556Sobrienexisting output files. Also forces 21178556Sobrien.I bzip2 21278556Sobriento break hard links 21378556Sobriento files, which it otherwise wouldn't do. 21490067Ssobomax 21590067Ssobomaxbzip2 normally declines to decompress files which don't have the 21690067Ssobomaxcorrect magic header bytes. If forced (-f), however, it will pass 21790067Ssobomaxsuch files through unmodified. This is how GNU gzip behaves. 21878556Sobrien.TP 21978556Sobrien.B \-k --keep 22078556SobrienKeep (don't delete) input files during compression 22178556Sobrienor decompression. 22278556Sobrien.TP 22378556Sobrien.B \-s --small 22478556SobrienReduce memory usage, for compression, decompression and testing. Files 22578556Sobrienare decompressed and tested using a modified algorithm which only 22678556Sobrienrequires 2.5 bytes per block byte. This means any file can be 22778556Sobriendecompressed in 2300k of memory, albeit at about half the normal speed. 22878556Sobrien 22978556SobrienDuring compression, \-s selects a block size of 200k, which limits 23078556Sobrienmemory use to around the same figure, at the expense of your compression 23178556Sobrienratio. In short, if your machine is low on memory (8 megabytes or 23278556Sobrienless), use \-s for everything. See MEMORY MANAGEMENT below. 23378556Sobrien.TP 23478556Sobrien.B \-q --quiet 23578556SobrienSuppress non-essential warning messages. Messages pertaining to 23678556SobrienI/O errors and other critical events will not be suppressed. 23778556Sobrien.TP 23878556Sobrien.B \-v --verbose 23978556SobrienVerbose mode -- show the compression ratio for each file processed. 24078556SobrienFurther \-v's increase the verbosity level, spewing out lots of 24178556Sobrieninformation which is primarily of interest for diagnostic purposes. 24278556Sobrien.TP 24378556Sobrien.B \-L --license -V --version 24478556SobrienDisplay the software version, license terms and conditions. 24578556Sobrien.TP 24690067Ssobomax.B \-1 (or \-\-fast) to \-9 (or \-\-best) 24778556SobrienSet the block size to 100 k, 200 k .. 900 k when compressing. Has no 24878556Sobrieneffect when decompressing. See MEMORY MANAGEMENT below. 24990067SsobomaxThe \-\-fast and \-\-best aliases are primarily for GNU gzip 25090067Ssobomaxcompatibility. In particular, \-\-fast doesn't make things 25190067Ssobomaxsignificantly faster. 25290067SsobomaxAnd \-\-best merely selects the default behaviour. 25378556Sobrien.TP 25478556Sobrien.B \-- 25578556SobrienTreats all subsequent arguments as file names, even if they start 25678556Sobrienwith a dash. This is so you can handle files with names beginning 25778556Sobrienwith a dash, for example: bzip2 \-- \-myfilename. 25878556Sobrien.TP 25978556Sobrien.B \--repetitive-fast --repetitive-best 26078556SobrienThese flags are redundant in versions 0.9.5 and above. They provided 26178556Sobriensome coarse control over the behaviour of the sorting algorithm in 26278556Sobrienearlier versions, which was sometimes useful. 0.9.5 and above have an 26378556Sobrienimproved algorithm which renders these flags irrelevant. 26478556Sobrien 26578556Sobrien.SH MEMORY MANAGEMENT 26678556Sobrien.I bzip2 26778556Sobriencompresses large files in blocks. The block size affects 26878556Sobrienboth the compression ratio achieved, and the amount of memory needed for 26978556Sobriencompression and decompression. The flags \-1 through \-9 27078556Sobrienspecify the block size to be 100,000 bytes through 900,000 bytes (the 27178556Sobriendefault) respectively. At decompression time, the block size used for 27278556Sobriencompression is read from the header of the compressed file, and 27378556Sobrien.I bunzip2 27478556Sobrienthen allocates itself just enough memory to decompress 27578556Sobrienthe file. Since block sizes are stored in compressed files, it follows 27678556Sobrienthat the flags \-1 to \-9 are irrelevant to and so ignored 27778556Sobrienduring decompression. 27878556Sobrien 27978556SobrienCompression and decompression requirements, 28078556Sobrienin bytes, can be estimated as: 28178556Sobrien 28278556Sobrien Compression: 400k + ( 8 x block size ) 28378556Sobrien 28478556Sobrien Decompression: 100k + ( 4 x block size ), or 28578556Sobrien 100k + ( 2.5 x block size ) 28678556Sobrien 28778556SobrienLarger block sizes give rapidly diminishing marginal returns. Most of 28878556Sobrienthe compression comes from the first two or three hundred k of block 28978556Sobriensize, a fact worth bearing in mind when using 29078556Sobrien.I bzip2 29178556Sobrienon small machines. 29278556SobrienIt is also important to appreciate that the decompression memory 29378556Sobrienrequirement is set at compression time by the choice of block size. 29478556Sobrien 29578556SobrienFor files compressed with the default 900k block size, 29678556Sobrien.I bunzip2 29778556Sobrienwill require about 3700 kbytes to decompress. To support decompression 29878556Sobrienof any file on a 4 megabyte machine, 29978556Sobrien.I bunzip2 30078556Sobrienhas an option to 30178556Sobriendecompress using approximately half this amount of memory, about 2300 30278556Sobrienkbytes. Decompression speed is also halved, so you should use this 30378556Sobrienoption only where necessary. The relevant flag is -s. 30478556Sobrien 30578556SobrienIn general, try and use the largest block size memory constraints allow, 30678556Sobriensince that maximises the compression achieved. Compression and 30778556Sobriendecompression speed are virtually unaffected by block size. 30878556Sobrien 30978556SobrienAnother significant point applies to files which fit in a single block 31078556Sobrien-- that means most files you'd encounter using a large block size. The 31178556Sobrienamount of real memory touched is proportional to the size of the file, 31278556Sobriensince the file is smaller than a block. For example, compressing a file 31378556Sobrien20,000 bytes long with the flag -9 will cause the compressor to 31478556Sobrienallocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 31578556Sobrienkbytes of it. Similarly, the decompressor will allocate 3700k but only 31678556Sobrientouch 100k + 20000 * 4 = 180 kbytes. 31778556Sobrien 31878556SobrienHere is a table which summarises the maximum memory usage for different 31978556Sobrienblock sizes. Also recorded is the total compressed size for 14 files of 32078556Sobrienthe Calgary Text Compression Corpus totalling 3,141,622 bytes. This 32178556Sobriencolumn gives some feel for how compression varies with block size. 32278556SobrienThese figures tend to understate the advantage of larger block sizes for 32378556Sobrienlarger files, since the Corpus is dominated by smaller files. 32478556Sobrien 32578556Sobrien Compress Decompress Decompress Corpus 32678556Sobrien Flag usage usage -s usage Size 32778556Sobrien 32878556Sobrien -1 1200k 500k 350k 914704 32978556Sobrien -2 2000k 900k 600k 877703 33078556Sobrien -3 2800k 1300k 850k 860338 33178556Sobrien -4 3600k 1700k 1100k 846899 33278556Sobrien -5 4400k 2100k 1350k 845160 33378556Sobrien -6 5200k 2500k 1600k 838626 33478556Sobrien -7 6100k 2900k 1850k 834096 33578556Sobrien -8 6800k 3300k 2100k 828642 33678556Sobrien -9 7600k 3700k 2350k 828642 33778556Sobrien 33878556Sobrien.SH RECOVERING DATA FROM DAMAGED FILES 33978556Sobrien.I bzip2 34078556Sobriencompresses files in blocks, usually 900kbytes long. Each 34178556Sobrienblock is handled independently. If a media or transmission error causes 34278556Sobriena multi-block .bz2 34378556Sobrienfile to become damaged, it may be possible to 34478556Sobrienrecover data from the undamaged blocks in the file. 34578556Sobrien 34678556SobrienThe compressed representation of each block is delimited by a 48-bit 34778556Sobrienpattern, which makes it possible to find the block boundaries with 34878556Sobrienreasonable certainty. Each block also carries its own 32-bit CRC, so 34978556Sobriendamaged blocks can be distinguished from undamaged ones. 35078556Sobrien 35178556Sobrien.I bzip2recover 35278556Sobrienis a simple program whose purpose is to search for 35378556Sobrienblocks in .bz2 files, and write each block out into its own .bz2 35478556Sobrienfile. You can then use 35578556Sobrien.I bzip2 35678556Sobrien\-t 35778556Sobriento test the 35878556Sobrienintegrity of the resulting files, and decompress those which are 35978556Sobrienundamaged. 36078556Sobrien 36178556Sobrien.I bzip2recover 36278556Sobrientakes a single argument, the name of the damaged file, 36390067Ssobomaxand writes a number of files "rec00001file.bz2", 36490067Ssobomax"rec00002file.bz2", etc, containing the extracted blocks. 36578556SobrienThe output filenames are designed so that the use of 36678556Sobrienwildcards in subsequent processing -- for example, 36790067Ssobomax"bzip2 -dc rec*file.bz2 > recovered_data" -- processes the files in 36878556Sobrienthe correct order. 36978556Sobrien 37078556Sobrien.I bzip2recover 37178556Sobrienshould be of most use dealing with large .bz2 37278556Sobrienfiles, as these will contain many blocks. It is clearly 37378556Sobrienfutile to use it on damaged single-block files, since a 37478556Sobriendamaged block cannot be recovered. If you wish to minimise 37578556Sobrienany potential data loss through media or transmission errors, 37678556Sobrienyou might consider compressing with a smaller 37778556Sobrienblock size. 37878556Sobrien 37978556Sobrien.SH PERFORMANCE NOTES 38078556SobrienThe sorting phase of compression gathers together similar strings in the 38178556Sobrienfile. Because of this, files containing very long runs of repeated 38278556Sobriensymbols, like "aabaabaabaab ..." (repeated several hundred times) may 38378556Sobriencompress more slowly than normal. Versions 0.9.5 and above fare much 38478556Sobrienbetter than previous versions in this respect. The ratio between 38578556Sobrienworst-case and average-case compression time is in the region of 10:1. 38678556SobrienFor previous versions, this figure was more like 100:1. You can use the 38778556Sobrien\-vvvv option to monitor progress in great detail, if you want. 38878556Sobrien 38978556SobrienDecompression speed is unaffected by these phenomena. 39078556Sobrien 39178556Sobrien.I bzip2 39278556Sobrienusually allocates several megabytes of memory to operate 39378556Sobrienin, and then charges all over it in a fairly random fashion. This means 39478556Sobrienthat performance, both for compressing and decompressing, is largely 39578556Sobriendetermined by the speed at which your machine can service cache misses. 39678556SobrienBecause of this, small changes to the code to reduce the miss rate have 39778556Sobrienbeen observed to give disproportionately large performance improvements. 39878556SobrienI imagine 39978556Sobrien.I bzip2 40078556Sobrienwill perform best on machines with very large caches. 40178556Sobrien 40278556Sobrien.SH CAVEATS 40378556SobrienI/O error messages are not as helpful as they could be. 40478556Sobrien.I bzip2 40578556Sobrientries hard to detect I/O errors and exit cleanly, but the details of 40678556Sobrienwhat the problem is sometimes seem rather misleading. 40778556Sobrien 408215041SobrienThis manual page pertains to version 1.0.6 of 40978556Sobrien.I bzip2. 41090067SsobomaxCompressed data created by this version is entirely forwards and 41190067Ssobomaxbackwards compatible with the previous public releases, versions 412215041Sobrien0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and above, but with the following 41390067Ssobomaxexception: 0.9.0 and above can correctly decompress multiple 41490067Ssobomaxconcatenated compressed files. 0.1pl2 cannot do this; it will stop 41590067Ssobomaxafter decompressing just the first file in the stream. 41678556Sobrien 41778556Sobrien.I bzip2recover 418146293Sobrienversions prior to 1.0.2 used 32-bit integers to represent 419146293Sobrienbit positions in compressed files, so they could not handle compressed 420146293Sobrienfiles more than 512 megabytes long. Versions 1.0.2 and above use 42190067Ssobomax64-bit ints on some platforms which support them (GNU supported 42290067Ssobomaxtargets, and Windows). To establish whether or not bzip2recover was 42390067Ssobomaxbuilt with such a limitation, run it without arguments. In any event 42490067Ssobomaxyou can build yourself an unlimited version if you can recompile it 42590067Ssobomaxwith MaybeUInt64 set to be an unsigned 64-bit integer. 42678556Sobrien 42790067Ssobomax 42890067Ssobomax 42978556Sobrien.SH AUTHOR 430146293SobrienJulian Seward, jsewardbzip.org. 43178556Sobrien 432146293Sobrienhttp://www.bzip.org 43378556Sobrien 43478556SobrienThe ideas embodied in 43578556Sobrien.I bzip2 43678556Sobrienare due to (at least) the following 43778556Sobrienpeople: Michael Burrows and David Wheeler (for the block sorting 43878556Sobrientransformation), David Wheeler (again, for the Huffman coder), Peter 43978556SobrienFenwick (for the structured coding model in the original 44078556Sobrien.I bzip, 44178556Sobrienand many refinements), and Alistair Moffat, Radford Neal and Ian Witten 44278556Sobrien(for the arithmetic coder in the original 44378556Sobrien.I bzip). 44478556SobrienI am much 44578556Sobrienindebted for their help, support and advice. See the manual in the 44678556Sobriensource distribution for pointers to sources of documentation. Christian 44778556Sobrienvon Roques encouraged me to look for faster sorting algorithms, so as to 44878556Sobrienspeed up compression. Bela Lubkin encouraged me to improve the 44990067Ssobomaxworst-case compression performance. 450146293SobrienDonna Robinson XMLised the documentation. 45190067SsobomaxThe bz* scripts are derived from those of GNU gzip. 45290067SsobomaxMany people sent patches, helped 45378556Sobrienwith portability problems, lent machines, gave advice and were generally 45478556Sobrienhelpful. 455