178556Sobrien.PU
278556Sobrien.TH bzip2 1
378556Sobrien.SH NAME
4215041Sobrienbzip2, bunzip2 \- a block-sorting file compressor, v1.0.6
578556Sobrien.br
678556Sobrienbzcat \- decompresses files to stdout
778556Sobrien.br
878556Sobrienbzip2recover \- recovers data from damaged bzip2 files
978556Sobrien
1078556Sobrien.SH SYNOPSIS
1178556Sobrien.ll +8
1278556Sobrien.B bzip2
1378556Sobrien.RB [ " \-cdfkqstvzVL123456789 " ]
1478556Sobrien[
1578556Sobrien.I "filenames \&..."
1678556Sobrien]
1778556Sobrien.ll -8
1878556Sobrien.br
1978556Sobrien.B bunzip2
2078556Sobrien.RB [ " \-fkvsVL " ]
2178556Sobrien[ 
2278556Sobrien.I "filenames \&..."
2378556Sobrien]
2478556Sobrien.br
2578556Sobrien.B bzcat
2678556Sobrien.RB [ " \-s " ]
2778556Sobrien[ 
2878556Sobrien.I "filenames \&..."
2978556Sobrien]
3078556Sobrien.br
3178556Sobrien.B bzip2recover
3278556Sobrien.I "filename"
3378556Sobrien
3478556Sobrien.SH DESCRIPTION
3578556Sobrien.I bzip2
3678556Sobriencompresses files using the Burrows-Wheeler block sorting
3778556Sobrientext compression algorithm, and Huffman coding.  Compression is
3878556Sobriengenerally considerably better than that achieved by more conventional
3978556SobrienLZ77/LZ78-based compressors, and approaches the performance of the PPM
4078556Sobrienfamily of statistical compressors.
4178556Sobrien
4278556SobrienThe command-line options are deliberately very similar to 
4378556Sobrienthose of 
4478556Sobrien.I GNU gzip, 
4578556Sobrienbut they are not identical.
4678556Sobrien
4778556Sobrien.I bzip2
4878556Sobrienexpects a list of file names to accompany the
4978556Sobriencommand-line flags.  Each file is replaced by a compressed version of
5078556Sobrienitself, with the name "original_name.bz2".  
5178556SobrienEach compressed file
5278556Sobrienhas the same modification date, permissions, and, when possible,
5378556Sobrienownership as the corresponding original, so that these properties can
5478556Sobrienbe correctly restored at decompression time.  File name handling is
5578556Sobriennaive in the sense that there is no mechanism for preserving original
5678556Sobrienfile names, permissions, ownerships or dates in filesystems which lack
5778556Sobrienthese concepts, or have serious file name length restrictions, such as
5878556SobrienMS-DOS.
5978556Sobrien
6078556Sobrien.I bzip2
6178556Sobrienand
6278556Sobrien.I bunzip2
6378556Sobrienwill by default not overwrite existing
6478556Sobrienfiles.  If you want this to happen, specify the \-f flag.
6578556Sobrien
6678556SobrienIf no file names are specified,
6778556Sobrien.I bzip2
6878556Sobriencompresses from standard
6978556Sobrieninput to standard output.  In this case,
7078556Sobrien.I bzip2
7178556Sobrienwill decline to
7278556Sobrienwrite compressed output to a terminal, as this would be entirely
7378556Sobrienincomprehensible and therefore pointless.
7478556Sobrien
7578556Sobrien.I bunzip2
7678556Sobrien(or
7778556Sobrien.I bzip2 \-d) 
7878556Sobriendecompresses all
7978556Sobrienspecified files.  Files which were not created by 
8078556Sobrien.I bzip2
8178556Sobrienwill be detected and ignored, and a warning issued.  
8278556Sobrien.I bzip2
8378556Sobrienattempts to guess the filename for the decompressed file 
8478556Sobrienfrom that of the compressed file as follows:
8578556Sobrien
8678556Sobrien       filename.bz2    becomes   filename
8778556Sobrien       filename.bz     becomes   filename
8878556Sobrien       filename.tbz2   becomes   filename.tar
8978556Sobrien       filename.tbz    becomes   filename.tar
9078556Sobrien       anyothername    becomes   anyothername.out
9178556Sobrien
9278556SobrienIf the file does not end in one of the recognised endings, 
9378556Sobrien.I .bz2, 
9478556Sobrien.I .bz, 
9578556Sobrien.I .tbz2
9678556Sobrienor
9778556Sobrien.I .tbz, 
9878556Sobrien.I bzip2 
9978556Sobriencomplains that it cannot
10078556Sobrienguess the name of the original file, and uses the original name
10178556Sobrienwith
10278556Sobrien.I .out
10378556Sobrienappended.
10478556Sobrien
10578556SobrienAs with compression, supplying no
10678556Sobrienfilenames causes decompression from 
10778556Sobrienstandard input to standard output.
10878556Sobrien
10978556Sobrien.I bunzip2 
11078556Sobrienwill correctly decompress a file which is the
11178556Sobrienconcatenation of two or more compressed files.  The result is the
11278556Sobrienconcatenation of the corresponding uncompressed files.  Integrity
11378556Sobrientesting (\-t) 
11478556Sobrienof concatenated 
11578556Sobriencompressed files is also supported.
11678556Sobrien
11778556SobrienYou can also compress or decompress files to the standard output by
11878556Sobriengiving the \-c flag.  Multiple files may be compressed and
11978556Sobriendecompressed like this.  The resulting outputs are fed sequentially to
12078556Sobrienstdout.  Compression of multiple files 
12178556Sobrienin this manner generates a stream
12278556Sobriencontaining multiple compressed file representations.  Such a stream
12378556Sobriencan be decompressed correctly only by
12478556Sobrien.I bzip2 
12578556Sobrienversion 0.9.0 or
12678556Sobrienlater.  Earlier versions of
12778556Sobrien.I bzip2
12878556Sobrienwill stop after decompressing
12978556Sobrienthe first file in the stream.
13078556Sobrien
13178556Sobrien.I bzcat
13278556Sobrien(or
13378556Sobrien.I bzip2 -dc) 
13478556Sobriendecompresses all specified files to
13578556Sobrienthe standard output.
13678556Sobrien
13778556Sobrien.I bzip2
13878556Sobrienwill read arguments from the environment variables
13978556Sobrien.I BZIP2
14078556Sobrienand
14178556Sobrien.I BZIP,
14278556Sobrienin that order, and will process them
14378556Sobrienbefore any arguments read from the command line.  This gives a 
14478556Sobrienconvenient way to supply default arguments.
14578556Sobrien
14678556SobrienCompression is always performed, even if the compressed 
14778556Sobrienfile is slightly
14878556Sobrienlarger than the original.  Files of less than about one hundred bytes
14978556Sobrientend to get larger, since the compression mechanism has a constant
15078556Sobrienoverhead in the region of 50 bytes.  Random data (including the output
15178556Sobrienof most file compressors) is coded at about 8.05 bits per byte, giving
15278556Sobrienan expansion of around 0.5%.
15378556Sobrien
15478556SobrienAs a self-check for your protection, 
15578556Sobrien.I 
15678556Sobrienbzip2
15778556Sobrienuses 32-bit CRCs to
15878556Sobrienmake sure that the decompressed version of a file is identical to the
15978556Sobrienoriginal.  This guards against corruption of the compressed data, and
16078556Sobrienagainst undetected bugs in
16178556Sobrien.I bzip2
16278556Sobrien(hopefully very unlikely).  The
16378556Sobrienchances of data corruption going undetected is microscopic, about one
16478556Sobrienchance in four billion for each file processed.  Be aware, though, that
16578556Sobrienthe check occurs upon decompression, so it can only tell you that
16678556Sobriensomething is wrong.  It can't help you 
16778556Sobrienrecover the original uncompressed
16878556Sobriendata.  You can use 
16978556Sobrien.I bzip2recover
17078556Sobriento try to recover data from
17178556Sobriendamaged files.
17278556Sobrien
17378556SobrienReturn values: 0 for a normal exit, 1 for environmental problems (file
17478556Sobriennot found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
17578556Sobriencompressed file, 3 for an internal consistency error (eg, bug) which
17678556Sobriencaused
17778556Sobrien.I bzip2
17878556Sobriento panic.
17978556Sobrien
18078556Sobrien.SH OPTIONS
18178556Sobrien.TP
18278556Sobrien.B \-c --stdout
18378556SobrienCompress or decompress to standard output.
18478556Sobrien.TP
18578556Sobrien.B \-d --decompress
18678556SobrienForce decompression.  
18778556Sobrien.I bzip2, 
18878556Sobrien.I bunzip2 
18978556Sobrienand
19078556Sobrien.I bzcat 
19178556Sobrienare
19278556Sobrienreally the same program, and the decision about what actions to take is
19378556Sobriendone on the basis of which name is used.  This flag overrides that
19478556Sobrienmechanism, and forces 
19578556Sobrien.I bzip2
19678556Sobriento decompress.
19778556Sobrien.TP
19878556Sobrien.B \-z --compress
19978556SobrienThe complement to \-d: forces compression, regardless of the
20090067Ssobomaxinvocation name.
20178556Sobrien.TP
20278556Sobrien.B \-t --test
20378556SobrienCheck integrity of the specified file(s), but don't decompress them.
20478556SobrienThis really performs a trial decompression and throws away the result.
20578556Sobrien.TP
20678556Sobrien.B \-f --force
20778556SobrienForce overwrite of output files.  Normally,
20878556Sobrien.I bzip2 
20978556Sobrienwill not overwrite
21078556Sobrienexisting output files.  Also forces 
21178556Sobrien.I bzip2 
21278556Sobriento break hard links
21378556Sobriento files, which it otherwise wouldn't do.
21490067Ssobomax
21590067Ssobomaxbzip2 normally declines to decompress files which don't have the
21690067Ssobomaxcorrect magic header bytes.  If forced (-f), however, it will pass
21790067Ssobomaxsuch files through unmodified.  This is how GNU gzip behaves.
21878556Sobrien.TP
21978556Sobrien.B \-k --keep
22078556SobrienKeep (don't delete) input files during compression
22178556Sobrienor decompression.
22278556Sobrien.TP
22378556Sobrien.B \-s --small
22478556SobrienReduce memory usage, for compression, decompression and testing.  Files
22578556Sobrienare decompressed and tested using a modified algorithm which only
22678556Sobrienrequires 2.5 bytes per block byte.  This means any file can be
22778556Sobriendecompressed in 2300k of memory, albeit at about half the normal speed.
22878556Sobrien
22978556SobrienDuring compression, \-s selects a block size of 200k, which limits
23078556Sobrienmemory use to around the same figure, at the expense of your compression
23178556Sobrienratio.  In short, if your machine is low on memory (8 megabytes or
23278556Sobrienless), use \-s for everything.  See MEMORY MANAGEMENT below.
23378556Sobrien.TP
23478556Sobrien.B \-q --quiet
23578556SobrienSuppress non-essential warning messages.  Messages pertaining to
23678556SobrienI/O errors and other critical events will not be suppressed.
23778556Sobrien.TP
23878556Sobrien.B \-v --verbose
23978556SobrienVerbose mode -- show the compression ratio for each file processed.
24078556SobrienFurther \-v's increase the verbosity level, spewing out lots of
24178556Sobrieninformation which is primarily of interest for diagnostic purposes.
24278556Sobrien.TP
24378556Sobrien.B \-L --license -V --version
24478556SobrienDisplay the software version, license terms and conditions.
24578556Sobrien.TP
24690067Ssobomax.B \-1 (or \-\-fast) to \-9 (or \-\-best)
24778556SobrienSet the block size to 100 k, 200 k ..  900 k when compressing.  Has no
24878556Sobrieneffect when decompressing.  See MEMORY MANAGEMENT below.
24990067SsobomaxThe \-\-fast and \-\-best aliases are primarily for GNU gzip 
25090067Ssobomaxcompatibility.  In particular, \-\-fast doesn't make things
25190067Ssobomaxsignificantly faster.  
25290067SsobomaxAnd \-\-best merely selects the default behaviour.
25378556Sobrien.TP
25478556Sobrien.B \--
25578556SobrienTreats all subsequent arguments as file names, even if they start
25678556Sobrienwith a dash.  This is so you can handle files with names beginning
25778556Sobrienwith a dash, for example: bzip2 \-- \-myfilename.
25878556Sobrien.TP
25978556Sobrien.B \--repetitive-fast --repetitive-best
26078556SobrienThese flags are redundant in versions 0.9.5 and above.  They provided
26178556Sobriensome coarse control over the behaviour of the sorting algorithm in
26278556Sobrienearlier versions, which was sometimes useful.  0.9.5 and above have an
26378556Sobrienimproved algorithm which renders these flags irrelevant.
26478556Sobrien
26578556Sobrien.SH MEMORY MANAGEMENT
26678556Sobrien.I bzip2 
26778556Sobriencompresses large files in blocks.  The block size affects
26878556Sobrienboth the compression ratio achieved, and the amount of memory needed for
26978556Sobriencompression and decompression.  The flags \-1 through \-9
27078556Sobrienspecify the block size to be 100,000 bytes through 900,000 bytes (the
27178556Sobriendefault) respectively.  At decompression time, the block size used for
27278556Sobriencompression is read from the header of the compressed file, and
27378556Sobrien.I bunzip2
27478556Sobrienthen allocates itself just enough memory to decompress
27578556Sobrienthe file.  Since block sizes are stored in compressed files, it follows
27678556Sobrienthat the flags \-1 to \-9 are irrelevant to and so ignored
27778556Sobrienduring decompression.
27878556Sobrien
27978556SobrienCompression and decompression requirements, 
28078556Sobrienin bytes, can be estimated as:
28178556Sobrien
28278556Sobrien       Compression:   400k + ( 8 x block size )
28378556Sobrien
28478556Sobrien       Decompression: 100k + ( 4 x block size ), or
28578556Sobrien                      100k + ( 2.5 x block size )
28678556Sobrien
28778556SobrienLarger block sizes give rapidly diminishing marginal returns.  Most of
28878556Sobrienthe compression comes from the first two or three hundred k of block
28978556Sobriensize, a fact worth bearing in mind when using
29078556Sobrien.I bzip2
29178556Sobrienon small machines.
29278556SobrienIt is also important to appreciate that the decompression memory
29378556Sobrienrequirement is set at compression time by the choice of block size.
29478556Sobrien
29578556SobrienFor files compressed with the default 900k block size,
29678556Sobrien.I bunzip2
29778556Sobrienwill require about 3700 kbytes to decompress.  To support decompression
29878556Sobrienof any file on a 4 megabyte machine, 
29978556Sobrien.I bunzip2
30078556Sobrienhas an option to
30178556Sobriendecompress using approximately half this amount of memory, about 2300
30278556Sobrienkbytes.  Decompression speed is also halved, so you should use this
30378556Sobrienoption only where necessary.  The relevant flag is -s.
30478556Sobrien
30578556SobrienIn general, try and use the largest block size memory constraints allow,
30678556Sobriensince that maximises the compression achieved.  Compression and
30778556Sobriendecompression speed are virtually unaffected by block size.
30878556Sobrien
30978556SobrienAnother significant point applies to files which fit in a single block
31078556Sobrien-- that means most files you'd encounter using a large block size.  The
31178556Sobrienamount of real memory touched is proportional to the size of the file,
31278556Sobriensince the file is smaller than a block.  For example, compressing a file
31378556Sobrien20,000 bytes long with the flag -9 will cause the compressor to
31478556Sobrienallocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
31578556Sobrienkbytes of it.  Similarly, the decompressor will allocate 3700k but only
31678556Sobrientouch 100k + 20000 * 4 = 180 kbytes.
31778556Sobrien
31878556SobrienHere is a table which summarises the maximum memory usage for different
31978556Sobrienblock sizes.  Also recorded is the total compressed size for 14 files of
32078556Sobrienthe Calgary Text Compression Corpus totalling 3,141,622 bytes.  This
32178556Sobriencolumn gives some feel for how compression varies with block size.
32278556SobrienThese figures tend to understate the advantage of larger block sizes for
32378556Sobrienlarger files, since the Corpus is dominated by smaller files.
32478556Sobrien
32578556Sobrien           Compress   Decompress   Decompress   Corpus
32678556Sobrien    Flag     usage      usage       -s usage     Size
32778556Sobrien
32878556Sobrien     -1      1200k       500k         350k      914704
32978556Sobrien     -2      2000k       900k         600k      877703
33078556Sobrien     -3      2800k      1300k         850k      860338
33178556Sobrien     -4      3600k      1700k        1100k      846899
33278556Sobrien     -5      4400k      2100k        1350k      845160
33378556Sobrien     -6      5200k      2500k        1600k      838626
33478556Sobrien     -7      6100k      2900k        1850k      834096
33578556Sobrien     -8      6800k      3300k        2100k      828642
33678556Sobrien     -9      7600k      3700k        2350k      828642
33778556Sobrien
33878556Sobrien.SH RECOVERING DATA FROM DAMAGED FILES
33978556Sobrien.I bzip2
34078556Sobriencompresses files in blocks, usually 900kbytes long.  Each
34178556Sobrienblock is handled independently.  If a media or transmission error causes
34278556Sobriena multi-block .bz2
34378556Sobrienfile to become damaged, it may be possible to
34478556Sobrienrecover data from the undamaged blocks in the file.
34578556Sobrien
34678556SobrienThe compressed representation of each block is delimited by a 48-bit
34778556Sobrienpattern, which makes it possible to find the block boundaries with
34878556Sobrienreasonable certainty.  Each block also carries its own 32-bit CRC, so
34978556Sobriendamaged blocks can be distinguished from undamaged ones.
35078556Sobrien
35178556Sobrien.I bzip2recover
35278556Sobrienis a simple program whose purpose is to search for
35378556Sobrienblocks in .bz2 files, and write each block out into its own .bz2 
35478556Sobrienfile.  You can then use
35578556Sobrien.I bzip2 
35678556Sobrien\-t
35778556Sobriento test the
35878556Sobrienintegrity of the resulting files, and decompress those which are
35978556Sobrienundamaged.
36078556Sobrien
36178556Sobrien.I bzip2recover
36278556Sobrientakes a single argument, the name of the damaged file, 
36390067Ssobomaxand writes a number of files "rec00001file.bz2",
36490067Ssobomax"rec00002file.bz2", etc, containing the  extracted  blocks.
36578556SobrienThe  output  filenames  are  designed  so  that the use of
36678556Sobrienwildcards in subsequent processing -- for example,  
36790067Ssobomax"bzip2 -dc  rec*file.bz2 > recovered_data" -- processes the files in
36878556Sobrienthe correct order.
36978556Sobrien
37078556Sobrien.I bzip2recover
37178556Sobrienshould be of most use dealing with large .bz2
37278556Sobrienfiles,  as  these will contain many blocks.  It is clearly
37378556Sobrienfutile to use it on damaged single-block  files,  since  a
37478556Sobriendamaged  block  cannot  be recovered.  If you wish to minimise 
37578556Sobrienany potential data loss through media  or  transmission errors, 
37678556Sobrienyou might consider compressing with a smaller
37778556Sobrienblock size.
37878556Sobrien
37978556Sobrien.SH PERFORMANCE NOTES
38078556SobrienThe sorting phase of compression gathers together similar strings in the
38178556Sobrienfile.  Because of this, files containing very long runs of repeated
38278556Sobriensymbols, like "aabaabaabaab ..."  (repeated several hundred times) may
38378556Sobriencompress more slowly than normal.  Versions 0.9.5 and above fare much
38478556Sobrienbetter than previous versions in this respect.  The ratio between
38578556Sobrienworst-case and average-case compression time is in the region of 10:1.
38678556SobrienFor previous versions, this figure was more like 100:1.  You can use the
38778556Sobrien\-vvvv option to monitor progress in great detail, if you want.
38878556Sobrien
38978556SobrienDecompression speed is unaffected by these phenomena.
39078556Sobrien
39178556Sobrien.I bzip2
39278556Sobrienusually allocates several megabytes of memory to operate
39378556Sobrienin, and then charges all over it in a fairly random fashion.  This means
39478556Sobrienthat performance, both for compressing and decompressing, is largely
39578556Sobriendetermined by the speed at which your machine can service cache misses.
39678556SobrienBecause of this, small changes to the code to reduce the miss rate have
39778556Sobrienbeen observed to give disproportionately large performance improvements.
39878556SobrienI imagine 
39978556Sobrien.I bzip2
40078556Sobrienwill perform best on machines with very large caches.
40178556Sobrien
40278556Sobrien.SH CAVEATS
40378556SobrienI/O error messages are not as helpful as they could be.
40478556Sobrien.I bzip2
40578556Sobrientries hard to detect I/O errors and exit cleanly, but the details of
40678556Sobrienwhat the problem is sometimes seem rather misleading.
40778556Sobrien
408215041SobrienThis manual page pertains to version 1.0.6 of
40978556Sobrien.I bzip2.  
41090067SsobomaxCompressed data created by this version is entirely forwards and
41190067Ssobomaxbackwards compatible with the previous public releases, versions
412215041Sobrien0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and above, but with the following
41390067Ssobomaxexception: 0.9.0 and above can correctly decompress multiple
41490067Ssobomaxconcatenated compressed files.  0.1pl2 cannot do this; it will stop
41590067Ssobomaxafter decompressing just the first file in the stream.
41678556Sobrien
41778556Sobrien.I bzip2recover
418146293Sobrienversions prior to 1.0.2 used 32-bit integers to represent
419146293Sobrienbit positions in compressed files, so they could not handle compressed
420146293Sobrienfiles more than 512 megabytes long.  Versions 1.0.2 and above use
42190067Ssobomax64-bit ints on some platforms which support them (GNU supported
42290067Ssobomaxtargets, and Windows).  To establish whether or not bzip2recover was
42390067Ssobomaxbuilt with such a limitation, run it without arguments.  In any event
42490067Ssobomaxyou can build yourself an unlimited version if you can recompile it
42590067Ssobomaxwith MaybeUInt64 set to be an unsigned 64-bit integer.
42678556Sobrien
42790067Ssobomax
42890067Ssobomax
42978556Sobrien.SH AUTHOR
430146293SobrienJulian Seward, jsewardbzip.org.
43178556Sobrien
432146293Sobrienhttp://www.bzip.org
43378556Sobrien
43478556SobrienThe ideas embodied in
43578556Sobrien.I bzip2
43678556Sobrienare due to (at least) the following
43778556Sobrienpeople: Michael Burrows and David Wheeler (for the block sorting
43878556Sobrientransformation), David Wheeler (again, for the Huffman coder), Peter
43978556SobrienFenwick (for the structured coding model in the original
44078556Sobrien.I bzip,
44178556Sobrienand many refinements), and Alistair Moffat, Radford Neal and Ian Witten
44278556Sobrien(for the arithmetic coder in the original
44378556Sobrien.I bzip).  
44478556SobrienI am much
44578556Sobrienindebted for their help, support and advice.  See the manual in the
44678556Sobriensource distribution for pointers to sources of documentation.  Christian
44778556Sobrienvon Roques encouraged me to look for faster sorting algorithms, so as to
44878556Sobrienspeed up compression.  Bela Lubkin encouraged me to improve the
44990067Ssobomaxworst-case compression performance.  
450146293SobrienDonna Robinson XMLised the documentation.
45190067SsobomaxThe bz* scripts are derived from those of GNU gzip.
45290067SsobomaxMany people sent patches, helped
45378556Sobrienwith portability problems, lent machines, gave advice and were generally
45478556Sobrienhelpful.
455