1205194SdelphijA Fast Method for Identifying Plain Text Files 2205194Sdelphij============================================== 3205194Sdelphij 4205194Sdelphij 5205194SdelphijIntroduction 6205194Sdelphij------------ 7205194Sdelphij 8205194SdelphijGiven a file coming from an unknown source, it is sometimes desirable 9205194Sdelphijto find out whether the format of that file is plain text. Although 10205194Sdelphijthis may appear like a simple task, a fully accurate detection of the 11205194Sdelphijfile type requires heavy-duty semantic analysis on the file contents. 12205194SdelphijIt is, however, possible to obtain satisfactory results by employing 13205194Sdelphijvarious heuristics. 14205194Sdelphij 15205194SdelphijPrevious versions of PKZip and other zip-compatible compression tools 16205194Sdelphijwere using a crude detection scheme: if more than 80% (4/5) of the bytes 17205194Sdelphijfound in a certain buffer are within the range [7..127], the file is 18205194Sdelphijlabeled as plain text, otherwise it is labeled as binary. A prominent 19205194Sdelphijlimitation of this scheme is the restriction to Latin-based alphabets. 20205194SdelphijOther alphabets, like Greek, Cyrillic or Asian, make extensive use of 21205194Sdelphijthe bytes within the range [128..255], and texts using these alphabets 22205194Sdelphijare most often misidentified by this scheme; in other words, the rate 23205194Sdelphijof false negatives is sometimes too high, which means that the recall 24205194Sdelphijis low. Another weakness of this scheme is a reduced precision, due to 25205194Sdelphijthe false positives that may occur when binary files containing large 26205194Sdelphijamounts of textual characters are misidentified as plain text. 27205194Sdelphij 28205194SdelphijIn this article we propose a new, simple detection scheme that features 29205194Sdelphija much increased precision and a near-100% recall. This scheme is 30205194Sdelphijdesigned to work on ASCII, Unicode and other ASCII-derived alphabets, 31205194Sdelphijand it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.) 32205194Sdelphijand variable-sized encodings (ISO-2022, UTF-8, etc.). Wider encodings 33205194Sdelphij(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however. 34205194Sdelphij 35205194Sdelphij 36205194SdelphijThe Algorithm 37205194Sdelphij------------- 38205194Sdelphij 39205194SdelphijThe algorithm works by dividing the set of bytecodes [0..255] into three 40205194Sdelphijcategories: 41205194Sdelphij- The white list of textual bytecodes: 42205194Sdelphij 9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255. 43205194Sdelphij- The gray list of tolerated bytecodes: 44205194Sdelphij 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC). 45205194Sdelphij- The black list of undesired, non-textual bytecodes: 46205194Sdelphij 0 (NUL) to 6, 14 to 31. 47205194Sdelphij 48205194SdelphijIf a file contains at least one byte that belongs to the white list and 49205194Sdelphijno byte that belongs to the black list, then the file is categorized as 50205194Sdelphijplain text; otherwise, it is categorized as binary. (The boundary case, 51205194Sdelphijwhen the file is empty, automatically falls into the latter category.) 52205194Sdelphij 53205194Sdelphij 54205194SdelphijRationale 55205194Sdelphij--------- 56205194Sdelphij 57205194SdelphijThe idea behind this algorithm relies on two observations. 58205194Sdelphij 59205194SdelphijThe first observation is that, although the full range of 7-bit codes 60205194Sdelphij[0..127] is properly specified by the ASCII standard, most control 61205194Sdelphijcharacters in the range [0..31] are not used in practice. The only 62205194Sdelphijwidely-used, almost universally-portable control codes are 9 (TAB), 63205194Sdelphij10 (LF) and 13 (CR). There are a few more control codes that are 64205194Sdelphijrecognized on a reduced range of platforms and text viewers/editors: 65205194Sdelphij7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these 66205194Sdelphijcodes are rarely (if ever) used alone, without being accompanied by 67205194Sdelphijsome printable text. Even the newer, portable text formats such as 68205194SdelphijXML avoid using control characters outside the list mentioned here. 69205194Sdelphij 70205194SdelphijThe second observation is that most of the binary files tend to contain 71205194Sdelphijcontrol characters, especially 0 (NUL). Even though the older text 72205194Sdelphijdetection schemes observe the presence of non-ASCII codes from the range 73205194Sdelphij[128..255], the precision rarely has to suffer if this upper range is 74205194Sdelphijlabeled as textual, because the files that are genuinely binary tend to 75205194Sdelphijcontain both control characters and codes from the upper range. On the 76205194Sdelphijother hand, the upper range needs to be labeled as textual, because it 77205194Sdelphijis used by virtually all ASCII extensions. In particular, this range is 78205194Sdelphijused for encoding non-Latin scripts. 79205194Sdelphij 80205194SdelphijSince there is no counting involved, other than simply observing the 81205194Sdelphijpresence or the absence of some byte values, the algorithm produces 82205194Sdelphijconsistent results, regardless what alphabet encoding is being used. 83205194Sdelphij(If counting were involved, it could be possible to obtain different 84205194Sdelphijresults on a text encoded, say, using ISO-8859-16 versus UTF-8.) 85205194Sdelphij 86205194SdelphijThere is an extra category of plain text files that are "polluted" with 87205194Sdelphijone or more black-listed codes, either by mistake or by peculiar design 88205194Sdelphijconsiderations. In such cases, a scheme that tolerates a small fraction 89205194Sdelphijof black-listed codes would provide an increased recall (i.e. more true 90205194Sdelphijpositives). This, however, incurs a reduced precision overall, since 91205194Sdelphijfalse positives are more likely to appear in binary files that contain 92205194Sdelphijlarge chunks of textual data. Furthermore, "polluted" plain text should 93205194Sdelphijbe regarded as binary by general-purpose text detection schemes, because 94205194Sdelphijgeneral-purpose text processing algorithms might not be applicable. 95205194SdelphijUnder this premise, it is safe to say that our detection method provides 96205194Sdelphija near-100% recall. 97205194Sdelphij 98205194SdelphijExperiments have been run on many files coming from various platforms 99205194Sdelphijand applications. We tried plain text files, system logs, source code, 100205194Sdelphijformatted office documents, compiled object code, etc. The results 101205194Sdelphijconfirm the optimistic assumptions about the capabilities of this 102205194Sdelphijalgorithm. 103205194Sdelphij 104205194Sdelphij 105205194Sdelphij-- 106205194SdelphijCosmin Truta 107205194SdelphijLast updated: 2006-May-28 108