Extended Regular Expressions



General Information

Extended regular expressions (ERE-s) are used in several places in the LDM system -- primarily as a pattern for matching

Consequently, it's important to know how to use ERE-s.

The official description of ERE-s can be found at http://www.opengroup.org/onlinepubs/7908799/xbd/re.html#tag_007_004 .

An excellent tool for determining a matching ERE from example text can be found at http://txt2re.com.

Here is an incomplete summary of ERE syntax:

Text:
  .           Any single character
  [chars]     Character class: One  of chars
  [^chars]    Character class: None of chars
  text1|text2 Alternative: text1 or text2

Quantifiers:
  ?           0 or 1 of the preceding text
  *           0 or N of the preceding text (N > 0)
  +           1 or N of the preceding text (N > 1)
  {M,N}       M through N of the preceding text (either may be missing)

Grouping:
  (text)      Grouping of text
              (either to set the borders of an alternative or
              for making backreferences where the Nth group can 
              be used in a pqact PIPE action with (for example) \N)

Anchors:
  ^           Start of string anchor
  $           End   of string anchor

Escaping:
  \char       escape that particular character
              (for instance to specify the characters ".[]()" 
              etc.)

Pathological ERE-s

Don't use ERE-s with a ".*" prefix because:

  1. They can take up to three orders of magnitude longer to match against a string than their unprefixed counterparts; and
  2. The ".*" prefix adds nothing to the ERE: the same set of strings is matched if it's omitted.
This only applies to ERE-s with a ".*" prefix: the ERE ".*", by itself, is perfectly OK.

The inefficiency of pathological ERE-s can be seen by using the UNIX time utility with the LDM's regex utility. First the non-pathological case:

$ time regex -n 10000 \
    -s 'lksjdfklsdjfkljsdfkljsdljfsdlkjfdlskjfldjflkjsdflkjsdflkjsd' \
    "some-sort-of-pattern"
no match

real	0m0.044s
user	0m0.040s
sys	0m0.000s
The above indicates that ten-thousand comparisons of the given string against the ERE took 0.04 seconds of user-time.

The timing of the corresponding pathological ERE is much different:

$ time regex -n 10000 \
    -s 'lksjdfklsdjfkljsdfkljsdljfsdlkjfdlskjfldjflkjsdflkjsdflkjsd' \
    ".*some-sort-of-pattern"
no match

real	0m18.424s
user	0m17.720s
sys	0m0.020s
The above took 17.72 seconds of user-time, meaning that the non-pathological ERE is 443 times more efficient than the pathological one. More complex pathological ERE-s have even worse results.