Extended Regular Expressions

General Information

Extended regular expressions (ERE-s) are used in several places in the LDM system -- primarily as a pattern for matching

Consequently, it's important to know how to use ERE-s.

See the official description of ERE-s.

There is an excellent tool for determining a matching ERE from example text.

Here is an incomplete summary of ERE syntax:

      Text:
        .           Any single character
        [chars]     Character class: One  of chars
        [^chars]    Character class: None of chars
        text1|text2 Alternative: text1 or text2

      Quantifiers:
        ?           0 or 1 of the preceding text
        *           0 or N of the preceding text (N > 0)
        +           1 or N of the preceding text (N > 1)
        {M,N}       M through N of the preceding text (either may be missing)

      Grouping:
        (text)      Grouping of text
                 (either to set the borders of an alternative or
                 for making backreferences where the Nth group can
                 be used in a pqact PIPE action with (for example) \N)

      Anchors:
        ^           Start of string anchor
        $           End   of string anchor

      Escaping:
        \char       escape that particular character
                 (for instance to specify the characters ".[]()"
                 etc.)
    

Pathological ERE-s

Don't use ERE-s with a ".*" prefix because:

  1. They can take up to three orders of magnitude longer to match against a string than their unprefixed counterparts; and
  2. The ".*" prefix adds nothing to the ERE: the same set of strings is matched if it's omitted.

This only applies to ERE-s with a ".*" prefix: the ERE ".*", by itself, is perfectly OK.

The inefficiency of pathological ERE-s can be seen by using the UNIX time utility with the LDM's regex utility. First the non-pathological case:

      $ time regex -n 10000 \
        -s 'lksjdfklsdjfkljsdfkljsdljfsdlkjfdlskjfldjflkjsdflkjsdflkjsd' \
        "some-sort-of-pattern"
      no match

      real	0m0.044s
      user	0m0.040s
      sys	0m0.000s
    

The above indicates that ten-thousand comparisons of the given string against the ERE took 0.04 seconds of user-time.

The timing of the corresponding pathological ERE is much different:

      $ time regex -n 10000 \
        -s 'lksjdfklsdjfkljsdfkljsdljfsdlkjfdlskjfldjflkjsdflkjsdflkjsd' \
        ".*some-sort-of-pattern"
      no match

      real	0m18.424s
      user	0m17.720s
      sys	0m0.020s
    

The above took 17.72 seconds of user-time, meaning that the non-pathological ERE is 443 times more efficient than the pathological one. More complex pathological ERE-s have even worse results.