Extended Regular Expressions

General Information

Extended regular expressions (ERE-s) are used in several places in the LDM system -- primarily as a pattern for matching

Consequently, it's important to know how to use ERE-s.

The official description of ERE-s can be found at http://www.opengroup.org/onlinepubs/7908799/xbd/re.html#tag_007_004 .

An excellent tool for determining a matching ERE from example text can be found at http://txt2re.com.

Here is an incomplete summary of ERE syntax:

  .           Any single character
  [chars]     Character class: One  of chars
  [^chars]    Character class: None of chars
  text1|text2 Alternative: text1 or text2

  ?           0 or 1 of the preceding text
  *           0 or N of the preceding text (N > 0)
  +           1 or N of the preceding text (N > 1)
  {M,N}       M through N of the preceding text (either may be missing)

  (text)      Grouping of text
              (either to set the borders of an alternative or
              for making backreferences where the Nth group can 
              be used in a pqact PIPE action with (for example) \N)

  ^           Start of string anchor
  $           End   of string anchor

  \char       escape that particular character
              (for instance to specify the characters ".[]()" 

Pathological ERE-s

Don't use ERE-s with a ".*" prefix because:

  1. They can take up to three orders of magnitude longer to match against a string than their unprefixed counterparts; and
  2. The ".*" prefix adds nothing to the ERE: the same set of strings is matched if it's omitted.
This only applies to ERE-s with a ".*" prefix: the ERE ".*", by itself, is perfectly OK.

The inefficiency of pathological ERE-s can be seen by using the UNIX time utility with the LDM's regex utility. First the non-pathological case:

$ time regex -n 10000 \
    -s 'lksjdfklsdjfkljsdfkljsdljfsdlkjfdlskjfldjflkjsdflkjsdflkjsd' \
no match

real	0m0.044s
user	0m0.040s
sys	0m0.000s
The above indicates that ten-thousand comparisons of the given string against the ERE took 0.04 seconds of user-time.

The timing of the corresponding pathological ERE is much different:

$ time regex -n 10000 \
    -s 'lksjdfklsdjfkljsdfkljsdljfsdlkjfdlskjfldjflkjsdflkjsdflkjsd' \
no match

real	0m18.424s
user	0m17.720s
sys	0m0.020s
The above took 17.72 seconds of user-time, meaning that the non-pathological ERE is 443 times more efficient than the pathological one. More complex pathological ERE-s have even worse results.