libz/doc/txtvsbin.txt

205194SdelphijA Fast Method for Identifying Plain Text Files
205194Sdelphij==============================================
205194Sdelphij
205194Sdelphij
205194SdelphijIntroduction
205194Sdelphij------------
205194Sdelphij
205194SdelphijGiven a file coming from an unknown source, it is sometimes desirable
205194Sdelphijto find out whether the format of that file is plain text.  Although
205194Sdelphijthis may appear like a simple task, a fully accurate detection of the
205194Sdelphijfile type requires heavy-duty semantic analysis on the file contents.
205194SdelphijIt is, however, possible to obtain satisfactory results by employing
205194Sdelphijvarious heuristics.
205194Sdelphij
205194SdelphijPrevious versions of PKZip and other zip-compatible compression tools
205194Sdelphijwere using a crude detection scheme: if more than 80% (4/5) of the bytes
205194Sdelphijfound in a certain buffer are within the range [7..127], the file is
205194Sdelphijlabeled as plain text, otherwise it is labeled as binary.  A prominent
205194Sdelphijlimitation of this scheme is the restriction to Latin-based alphabets.
205194SdelphijOther alphabets, like Greek, Cyrillic or Asian, make extensive use of
205194Sdelphijthe bytes within the range [128..255], and texts using these alphabets
205194Sdelphijare most often misidentified by this scheme; in other words, the rate
205194Sdelphijof false negatives is sometimes too high, which means that the recall
205194Sdelphijis low.  Another weakness of this scheme is a reduced precision, due to
205194Sdelphijthe false positives that may occur when binary files containing large
205194Sdelphijamounts of textual characters are misidentified as plain text.
205194Sdelphij
205194SdelphijIn this article we propose a new, simple detection scheme that features
205194Sdelphija much increased precision and a near-100% recall.  This scheme is
205194Sdelphijdesigned to work on ASCII, Unicode and other ASCII-derived alphabets,
205194Sdelphijand it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.)
205194Sdelphijand variable-sized encodings (ISO-2022, UTF-8, etc.).  Wider encodings
205194Sdelphij(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however.
205194Sdelphij
205194Sdelphij
205194SdelphijThe Algorithm
205194Sdelphij-------------
205194Sdelphij
205194SdelphijThe algorithm works by dividing the set of bytecodes [0..255] into three
205194Sdelphijcategories:
205194Sdelphij- The white list of textual bytecodes:
205194Sdelphij  9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255.
205194Sdelphij- The gray list of tolerated bytecodes:
205194Sdelphij  7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC).
205194Sdelphij- The black list of undesired, non-textual bytecodes:
205194Sdelphij  0 (NUL) to 6, 14 to 31.
205194Sdelphij
205194SdelphijIf a file contains at least one byte that belongs to the white list and
205194Sdelphijno byte that belongs to the black list, then the file is categorized as
205194Sdelphijplain text; otherwise, it is categorized as binary.  (The boundary case,
205194Sdelphijwhen the file is empty, automatically falls into the latter category.)
205194Sdelphij
205194Sdelphij
205194SdelphijRationale
205194Sdelphij---------
205194Sdelphij
205194SdelphijThe idea behind this algorithm relies on two observations.
205194Sdelphij
205194SdelphijThe first observation is that, although the full range of 7-bit codes
205194Sdelphij[0..127] is properly specified by the ASCII standard, most control
205194Sdelphijcharacters in the range [0..31] are not used in practice.  The only
205194Sdelphijwidely-used, almost universally-portable control codes are 9 (TAB),
205194Sdelphij10 (LF) and 13 (CR).  There are a few more control codes that are
205194Sdelphijrecognized on a reduced range of platforms and text viewers/editors:
205194Sdelphij7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these
205194Sdelphijcodes are rarely (if ever) used alone, without being accompanied by
205194Sdelphijsome printable text.  Even the newer, portable text formats such as
205194SdelphijXML avoid using control characters outside the list mentioned here.
205194Sdelphij
205194SdelphijThe second observation is that most of the binary files tend to contain
205194Sdelphijcontrol characters, especially 0 (NUL).  Even though the older text
205194Sdelphijdetection schemes observe the presence of non-ASCII codes from the range
205194Sdelphij[128..255], the precision rarely has to suffer if this upper range is
205194Sdelphijlabeled as textual, because the files that are genuinely binary tend to
205194Sdelphijcontain both control characters and codes from the upper range.  On the
205194Sdelphijother hand, the upper range needs to be labeled as textual, because it
205194Sdelphijis used by virtually all ASCII extensions.  In particular, this range is
205194Sdelphijused for encoding non-Latin scripts.
205194Sdelphij
205194SdelphijSince there is no counting involved, other than simply observing the
205194Sdelphijpresence or the absence of some byte values, the algorithm produces
205194Sdelphijconsistent results, regardless what alphabet encoding is being used.
205194Sdelphij(If counting were involved, it could be possible to obtain different
205194Sdelphijresults on a text encoded, say, using ISO-8859-16 versus UTF-8.)
205194Sdelphij
205194SdelphijThere is an extra category of plain text files that are "polluted" with
205194Sdelphijone or more black-listed codes, either by mistake or by peculiar design
205194Sdelphijconsiderations.  In such cases, a scheme that tolerates a small fraction
205194Sdelphijof black-listed codes would provide an increased recall (i.e. more true
205194Sdelphijpositives).  This, however, incurs a reduced precision overall, since
205194Sdelphijfalse positives are more likely to appear in binary files that contain
205194Sdelphijlarge chunks of textual data.  Furthermore, "polluted" plain text should
205194Sdelphijbe regarded as binary by general-purpose text detection schemes, because
205194Sdelphijgeneral-purpose text processing algorithms might not be applicable.
205194SdelphijUnder this premise, it is safe to say that our detection method provides
205194Sdelphija near-100% recall.
205194Sdelphij
205194SdelphijExperiments have been run on many files coming from various platforms
205194Sdelphijand applications.  We tried plain text files, system logs, source code,
205194Sdelphijformatted office documents, compiled object code, etc.  The results
205194Sdelphijconfirm the optimistic assumptions about the capabilities of this
205194Sdelphijalgorithm.
205194Sdelphij
205194Sdelphij
205194Sdelphij--
205194SdelphijCosmin Truta
205194SdelphijLast updated: 2006-May-28