1Technical Notes about PCRE 2-------------------------- 3 4These are very rough technical notes that record potentially useful information 5about PCRE internals. 6 7Historical note 1 8----------------- 9 10Many years ago I implemented some regular expression functions to an algorithm 11suggested by Martin Richards. These were not Unix-like in form, and were quite 12restricted in what they could do by comparison with Perl. The interesting part 13about the algorithm was that the amount of space required to hold the compiled 14form of an expression was known in advance. The code to apply an expression did 15not operate by backtracking, as the original Henry Spencer code and current 16Perl code does, but instead checked all possibilities simultaneously by keeping 17a list of current states and checking all of them as it advanced through the 18subject string. In the terminology of Jeffrey Friedl's book, it was a "DFA 19algorithm", though it was not a traditional Finite State Machine (FSM). When 20the pattern was all used up, all remaining states were possible matches, and 21the one matching the longest subset of the subject string was chosen. This did 22not necessarily maximize the individual wild portions of the pattern, as is 23expected in Unix and Perl-style regular expressions. 24 25Historical note 2 26----------------- 27 28By contrast, the code originally written by Henry Spencer (which was 29subsequently heavily modified for Perl) compiles the expression twice: once in 30a dummy mode in order to find out how much store will be needed, and then for 31real. (The Perl version probably doesn't do this any more; I'm talking about 32the original library.) The execution function operates by backtracking and 33maximizing (or, optionally, minimizing in Perl) the amount of the subject that 34matches individual wild portions of the pattern. This is an "NFA algorithm" in 35Friedl's terminology. 36 37OK, here's the real stuff 38------------------------- 39 40For the set of functions that form the "basic" PCRE library (which are 41unrelated to those mentioned above), I tried at first to invent an algorithm 42that used an amount of store bounded by a multiple of the number of characters 43in the pattern, to save on compiling time. However, because of the greater 44complexity in Perl regular expressions, I couldn't do this. In any case, a 45first pass through the pattern is helpful for other reasons. 46 47Computing the memory requirement: how it was 48-------------------------------------------- 49 50Up to and including release 6.7, PCRE worked by running a very degenerate first 51pass to calculate a maximum store size, and then a second pass to do the real 52compile - which might use a bit less than the predicted amount of memory. The 53idea was that this would turn out faster than the Henry Spencer code because 54the first pass is degenerate and the second pass can just store stuff straight 55into the vector, which it knows is big enough. 56 57Computing the memory requirement: how it is 58------------------------------------------- 59 60By the time I was working on a potential 6.8 release, the degenerate first pass 61had become very complicated and hard to maintain. Indeed one of the early 62things I did for 6.8 was to fix Yet Another Bug in the memory computation. Then 63I had a flash of inspiration as to how I could run the real compile function in 64a "fake" mode that enables it to compute how much memory it would need, while 65actually only ever using a few hundred bytes of working memory, and without too 66many tests of the mode that might slow it down. So I re-factored the compiling 67functions to work this way. This got rid of about 600 lines of source. It 68should make future maintenance and development easier. As this was such a major 69change, I never released 6.8, instead upping the number to 7.0 (other quite 70major changes were also present in the 7.0 release). 71 72A side effect of this work was that the previous limit of 200 on the nesting 73depth of parentheses was removed. However, there is a downside: pcre_compile() 74runs more slowly than before (30% or more, depending on the pattern) because it 75is doing a full analysis of the pattern. My hope was that this would not be a 76big issue, and in the event, nobody has commented on it. 77 78Traditional matching function 79----------------------------- 80 81The "traditional", and original, matching function is called pcre_exec(), and 82it implements an NFA algorithm, similar to the original Henry Spencer algorithm 83and the way that Perl works. This is not surprising, since it is intended to be 84as compatible with Perl as possible. This is the function most users of PCRE 85will use most of the time. 86 87Supplementary matching function 88------------------------------- 89 90From PCRE 6.0, there is also a supplementary matching function called 91pcre_dfa_exec(). This implements a DFA matching algorithm that searches 92simultaneously for all possible matches that start at one point in the subject 93string. (Going back to my roots: see Historical Note 1 above.) This function 94intreprets the same compiled pattern data as pcre_exec(); however, not all the 95facilities are available, and those that are do not always work in quite the 96same way. See the user documentation for details. 97 98The algorithm that is used for pcre_dfa_exec() is not a traditional FSM, 99because it may have a number of states active at one time. More work would be 100needed at compile time to produce a traditional FSM where only one state is 101ever active at once. I believe some other regex matchers work this way. 102 103 104Format of compiled patterns 105--------------------------- 106 107The compiled form of a pattern is a vector of bytes, containing items of 108variable length. The first byte in an item is an opcode, and the length of the 109item is either implicit in the opcode or contained in the data bytes that 110follow it. 111 112In many cases below LINK_SIZE data values are specified for offsets within the 113compiled pattern. The default value for LINK_SIZE is 2, but PCRE can be 114compiled to use 3-byte or 4-byte values for these offsets (impairing the 115performance). This is necessary only when patterns whose compiled length is 116greater than 64K are going to be processed. In this description, we assume the 117"normal" compilation options. Data values that are counts (e.g. for 118quantifiers) are always just two bytes long. 119 120A list of the opcodes follows: 121 122 123Opcodes with no following data 124------------------------------ 125 126These items are all just one byte long 127 128 OP_END end of pattern 129 OP_ANY match any one character other than newline 130 OP_ALLANY match any one character, including newline 131 OP_ANYBYTE match any single byte, even in UTF-8 mode 132 OP_SOD match start of data: \A 133 OP_SOM, start of match (subject + offset): \G 134 OP_SET_SOM, set start of match (\K) 135 OP_CIRC ^ (start of data, or after \n in multiline) 136 OP_NOT_WORD_BOUNDARY \W 137 OP_WORD_BOUNDARY \w 138 OP_NOT_DIGIT \D 139 OP_DIGIT \d 140 OP_NOT_HSPACE \H 141 OP_HSPACE \h 142 OP_NOT_WHITESPACE \S 143 OP_WHITESPACE \s 144 OP_NOT_VSPACE \V 145 OP_VSPACE \v 146 OP_NOT_WORDCHAR \W 147 OP_WORDCHAR \w 148 OP_EODN match end of data or \n at end: \Z 149 OP_EOD match end of data: \z 150 OP_DOLL $ (end of data, or before \n in multiline) 151 OP_EXTUNI match an extended Unicode character 152 OP_ANYNL match any Unicode newline sequence 153 154 OP_ACCEPT ) These are Perl 5.10's "backtracking 155 OP_COMMIT ) control verbs". If OP_ACCEPT is inside 156 OP_FAIL ) capturing parentheses, it may be preceded 157 OP_PRUNE ) by one or more OP_CLOSE, followed by a 2-byte 158 OP_SKIP ) number, indicating which parentheses must be 159 OP_THEN ) closed. 160 161 162Repeating single characters 163--------------------------- 164 165The common repeats (*, +, ?) when applied to a single character use the 166following opcodes: 167 168 OP_STAR 169 OP_MINSTAR 170 OP_POSSTAR 171 OP_PLUS 172 OP_MINPLUS 173 OP_POSPLUS 174 OP_QUERY 175 OP_MINQUERY 176 OP_POSQUERY 177 178In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable. 179Those with "MIN" in their name are the minimizing versions. Those with "POS" in 180their names are possessive versions. Each is followed by the character that is 181to be repeated. Other repeats make use of 182 183 OP_UPTO 184 OP_MINUPTO 185 OP_POSUPTO 186 OP_EXACT 187 188which are followed by a two-byte count (most significant first) and the 189repeated character. OP_UPTO matches from 0 to the given number. A repeat with a 190non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an 191OP_UPTO (or OP_MINUPTO or OPT_POSUPTO). 192 193 194Repeating character types 195------------------------- 196 197Repeats of things like \d are done exactly as for single characters, except 198that instead of a character, the opcode for the type is stored in the data 199byte. The opcodes are: 200 201 OP_TYPESTAR 202 OP_TYPEMINSTAR 203 OP_TYPEPOSSTAR 204 OP_TYPEPLUS 205 OP_TYPEMINPLUS 206 OP_TYPEPOSPLUS 207 OP_TYPEQUERY 208 OP_TYPEMINQUERY 209 OP_TYPEPOSQUERY 210 OP_TYPEUPTO 211 OP_TYPEMINUPTO 212 OP_TYPEPOSUPTO 213 OP_TYPEEXACT 214 215 216Match by Unicode property 217------------------------- 218 219OP_PROP and OP_NOTPROP are used for positive and negative matches of a 220character by testing its Unicode property (the \p and \P escape sequences). 221Each is followed by two bytes that encode the desired property as a type and a 222value. 223 224Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by 225three bytes: OP_PROP or OP_NOTPROP and then the desired property type and 226value. 227 228 229Matching literal characters 230--------------------------- 231 232The OP_CHAR opcode is followed by a single character that is to be matched 233casefully. For caseless matching, OP_CHARNC is used. In UTF-8 mode, the 234character may be more than one byte long. (Earlier versions of PCRE used 235multi-character strings, but this was changed to allow some new features to be 236added.) 237 238 239Character classes 240----------------- 241 242If there is only one character, OP_CHAR or OP_CHARNC is used for a positive 243class, and OP_NOT for a negative one (that is, for something like [^a]). 244However, in UTF-8 mode, the use of OP_NOT applies only to characters with 245values < 128, because OP_NOT is confined to single bytes. 246 247Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a repeated, 248negated, single-character class. The normal ones (OP_STAR etc.) are used for a 249repeated positive single-character class. 250 251When there's more than one character in a class and all the characters are less 252than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a negative 253one. In either case, the opcode is followed by a 32-byte bit map containing a 1 254bit for every character that is acceptable. The bits are counted from the least 255significant end of each byte. 256 257The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode, 258subject characters with values greater than 256 can be handled correctly. For 259OP_CLASS they don't match, whereas for OP_NCLASS they do. 260 261For classes containing characters with values > 255, OP_XCLASS is used. It 262optionally uses a bit map (if any characters lie within it), followed by a list 263of pairs and single characters. There is a flag character than indicates 264whether it's a positive or a negative class. 265 266 267Back references 268--------------- 269 270OP_REF is followed by two bytes containing the reference number. 271 272 273Repeating character classes and back references 274----------------------------------------------- 275 276Single-character classes are handled specially (see above). This section 277applies to OP_CLASS and OP_REF. In both cases, the repeat information follows 278the base item. The matching code looks at the following opcode to see if it is 279one of 280 281 OP_CRSTAR 282 OP_CRMINSTAR 283 OP_CRPLUS 284 OP_CRMINPLUS 285 OP_CRQUERY 286 OP_CRMINQUERY 287 OP_CRRANGE 288 OP_CRMINRANGE 289 290All but the last two are just single-byte items. The others are followed by 291four bytes of data, comprising the minimum and maximum repeat counts. There are 292no special possessive opcodes for these repeats; a possessive repeat is 293compiled into an atomic group. 294 295 296Brackets and alternation 297------------------------ 298 299A pair of non-capturing (round) brackets is wrapped round each expression at 300compile time, so alternation always happens in the context of brackets. 301 302[Note for North Americans: "bracket" to some English speakers, including 303myself, can be round, square, curly, or pointy. Hence this usage.] 304 305Non-capturing brackets use the opcode OP_BRA. Originally PCRE was limited to 99 306capturing brackets and it used a different opcode for each one. From release 3073.5, the limit was removed by putting the bracket number into the data for 308higher-numbered brackets. From release 7.0 all capturing brackets are handled 309this way, using the single opcode OP_CBRA. 310 311A bracket opcode is followed by LINK_SIZE bytes which give the offset to the 312next alternative OP_ALT or, if there aren't any branches, to the matching 313OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to 314the next one, or to the OP_KET opcode. For capturing brackets, the bracket 315number immediately follows the offset, always as a 2-byte item. 316 317OP_KET is used for subpatterns that do not repeat indefinitely, while 318OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or 319maximally respectively. All three are followed by LINK_SIZE bytes giving (as a 320positive number) the offset back to the matching bracket opcode. 321 322If a subpattern is quantified such that it is permitted to match zero times, it 323is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are 324single-byte opcodes that tell the matcher that skipping the following 325subpattern entirely is a valid branch. In the case of the first two, not 326skipping the pattern is also valid (greedy and non-greedy). The third is used 327when a pattern has the quantifier {0,0}. It cannot be entirely discarded, 328because it may be called as a subroutine from elsewhere in the regex. 329 330A subpattern with an indefinite maximum repetition is replicated in the 331compiled data its minimum number of times (or once with OP_BRAZERO if the 332minimum is zero), with the final copy terminating with OP_KETRMIN or OP_KETRMAX 333as appropriate. 334 335A subpattern with a bounded maximum repetition is replicated in a nested 336fashion up to the maximum number of times, with OP_BRAZERO or OP_BRAMINZERO 337before each replication after the minimum, so that, for example, (abc){2,5} is 338compiled as (abc)(abc)((abc)((abc)(abc)?)?)?, except that each bracketed group 339has the same number. 340 341When a repeated subpattern has an unbounded upper limit, it is checked to see 342whether it could match an empty string. If this is the case, the opcode in the 343final replication is changed to OP_SBRA or OP_SCBRA. This tells the matcher 344that it needs to check for matching an empty string when it hits OP_KETRMIN or 345OP_KETRMAX, and if so, to break the loop. 346 347 348Assertions 349---------- 350 351Forward assertions are just like other subpatterns, but starting with one of 352the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes 353OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion 354is OP_REVERSE, followed by a two byte count of the number of characters to move 355back the pointer in the subject string. When operating in UTF-8 mode, the count 356is a character count rather than a byte count. A separate count is present in 357each alternative of a lookbehind assertion, allowing them to have different 358fixed lengths. 359 360 361Once-only (atomic) subpatterns 362------------------------------ 363 364These are also just like other subpatterns, but they start with the opcode 365OP_ONCE. The check for matching an empty string in an unbounded repeat is 366handled entirely at runtime, so there is just this one opcode. 367 368 369Conditional subpatterns 370----------------------- 371 372These are like other subpatterns, but they start with the opcode OP_COND, or 373OP_SCOND for one that might match an empty string in an unbounded repeat. If 374the condition is a back reference, this is stored at the start of the 375subpattern using the opcode OP_CREF followed by two bytes containing the 376reference number. OP_NCREF is used instead if the reference was generated by 377name (so that the runtime code knows to check for duplicate names). 378 379If the condition is "in recursion" (coded as "(?(R)"), or "in recursion of 380group x" (coded as "(?(Rx)"), the group number is stored at the start of the 381subpattern using the opcode OP_RREF or OP_NRREF (cf OP_NCREF), and a value of 382zero for "the whole pattern". For a DEFINE condition, just the single byte 383OP_DEF is used (it has no associated data). Otherwise, a conditional subpattern 384always starts with one of the assertions. 385 386 387Recursion 388--------- 389 390Recursion either matches the current regex, or some subexpression. The opcode 391OP_RECURSE is followed by an value which is the offset to the starting bracket 392from the start of the whole pattern. From release 6.5, OP_RECURSE is 393automatically wrapped inside OP_ONCE brackets (because otherwise some patterns 394broke it). OP_RECURSE is also used for "subroutine" calls, even though they 395are not strictly a recursion. 396 397 398Callout 399------- 400 401OP_CALLOUT is followed by one byte of data that holds a callout number in the 402range 0 to 254 for manual callouts, or 255 for an automatic callout. In both 403cases there follows a two-byte value giving the offset in the pattern to the 404start of the following item, and another two-byte item giving the length of the 405next item. 406 407 408Changing options 409---------------- 410 411If any of the /i, /m, or /s options are changed within a pattern, an OP_OPT 412opcode is compiled, followed by one byte containing the new settings of these 413flags. If there are several alternatives, there is an occurrence of OP_OPT at 414the start of all those following the first options change, to set appropriate 415options for the start of the alternative. Immediately after the end of the 416group there is another such item to reset the flags to their previous values. A 417change of flag right at the very start of the pattern can be handled entirely 418at compile time, and so does not cause anything to be put into the compiled 419data. 420 421Philip Hazel 422October 2009 423