Unidata - To provide the data services, tools, and cyberinfrastructure leadership that advance Earth system science, enhance educational opportunities, and broaden participation. Unidata
         
  advanced  
 

Extended Regular Expressions



General Information

Extended regular expressions (ERE-s) are used in several places in the LDM system -- primarily as a pattern for matching

Consequently, it's important to know how to use ERE-s.

The official description of ERE-s can be found at http://www.opengroup.org/onlinepubs/7908799/xbd/re.html#tag_007_004 .

Here is an incomplete summary of ERE syntax:

Text:
  .           Any single character
  [chars]     Character class: One  of chars
  [^chars]    Character class: None of chars
  text1|text2 Alternative: text1 or text2

Quantifiers:
  ?           0 or 1 of the preceding text
  *           0 or N of the preceding text (N > 0)
  +           1 or N of the preceding text (N > 1)

Grouping:
  (text)      Grouping of text
              (either to set the borders of an alternative or
              for making backreferences where the Nth group can 
              be used in a pqact PIPE action with (for example) \N)

Anchors:
  ^           Start of string anchor
  $           End   of string anchor

Escaping:
  \char       escape that particular character
              (for instance to specify the characters ".[]()" 
              etc.)

Pathological ERE-s

Don't use ERE-s with a ".*" prefix because:

  1. They can take up to three orders of magnitude longer to match against a string than their unprefixed counterparts; and
  2. The ".*" prefix adds nothing to the ERE: the same set of strings is matched if it's omitted.
This only applies to ERE-s with a ".*" prefix: the ERE ".*", by itself, is perfectly OK.

The inefficiency of pathological ERE-s can be seen by using the UNIX® time utility with the LDM's regex utility. First the non-pathological case:

$ time regex -n 10000 \
    -s 'lksjdfklsdjfkljsdfkljsdljfsdlkjfdlskjfldjflkjsdflkjsdflkjsd' \
    "some-sort-of-pattern"
no match

real	0m0.044s
user	0m0.040s
sys	0m0.000s
The above indicates that ten-thousand comparisons of the given string against the ERE took 0.04 seconds of user-time.

The timing of the corresponding pathological ERE is much different:

$ time regex -n 10000 \
    -s 'lksjdfklsdjfkljsdfkljsdljfsdlkjfdlskjfldjflkjsdflkjsdflkjsd' \
    ".*some-sort-of-pattern"
no match

real	0m18.424s
user	0m17.720s
sys	0m0.020s
The above took 17.72 seconds of user-time, meaning that the non-pathological ERE is 443 times more efficient than the pathological one. More complex pathological ERE-s have even worse results.
 
 
  Contact Us     Site Map     Search     Terms and Conditions     Privacy Policy     Participation Policy
 
National Science Foundation (NSF) UCAR Office of Programs University Corporation for Atmospheric Research (UCAR)   Unidata is a member of the UCAR Office of Programs, is managed by the University Corporation for Atmospheric Research, and is sponsored by the National Science Foundation.
P.O. Box 3000     Boulder, CO 80307-3000 USA     Tel: 303-497-8643     Fax: 303-497-8690