[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #RJS-786355]: Regular expressions


> I've just borrowed from a sample pqact.conf file for a GEMPAK
> installation (provided by Tom Yoksas) a pattern for action in my
> pqact.conf file. Rather than trigger a decoder (a la virtually every
> action in the sample file), though, I'm just trying to file the data.
> A regular expression issue comes up. Here's a simplified example that
> I hope illustrates my conceptual problem:
> WMO   (^a)|(^b|(c|d)) .... ([0-3][0-9])([0-2][0-9])..
> FILE (\?:yy)(\?:mm)\?\(?+1)_type.wmo
> where "?" represents an integer that matches the paranthetical
> expression ([0-3][0-9]) (the day of the month) and "?+1" matches the
> next paranthetical expression, ([0-2][0-9]). The letters a, b, c, and
> d represent strings of one or more regular expressions without
> parentheses.
> The question is, what should "?" be?
> I have two conceptual uncertainties here. First, when two
> parenthetical expressions are separated by "|", are the two referred
> to by separate (sequential) values of \n (where n is an integer), or
> are they both referred to by the same value of \n (since they
> represent possibly mutually exclusive alternatives)?

Two parenthetical subexpressions separated by "|" would have two different
\n backreferences.

> Second, when parentheses are nested, how should the expressions they
> enclose be counted when determining an appropriate value of \n?

Backreference \n always refers to the subexpression enclosed by the n-th
unescaped left parenthesis.

> In the example above, "?" could be anywhere from 2 to 4, depending on
> the answers to these questions, and in one instance the number could
> vary depending on which option of the highest-level "|" ("or")
> structure in the example above is realized.
> The actual pattern that I'm working with is supposed to capture ship,
> buoy, and CMAN data and looks like this:
> WMO  (^S[IMN]V[^GINS])|^S[IMN]W[^KZ]|(^S(HV|HXX|S[^X]))|(^SX(VD|V.50|
> US(2[0-
> 3]|08|40|82|86)))|(^Y[HO]XX84) .... ([0-3][0-9])([0-2][0-9])..
> FILE data/surface/(\n:yy)(\n:mm)\n\m_boy.wmo
> where \n and \m are to be determined to get the date and time when the
> data were recorded.

There are some things wrong with the above extended regular expression.
As I recall, the first field in a WMO header has six characters: four 
letters followed by two digits.  The above ERE, however, would match,
for example, "SIVA ", "SIWA ", "SHV ", "SHXX ", and "SSA " -- which
don't fit the pattern of the first field of a WMO header.

To simplify things, you can always break-up a complicated ERE into
multiple pqact(1) entries, each one handling a subset of the complicated

Steve Emmerson

Ticket Details
Ticket ID: RJS-786355
Department: Support LDM
Priority: Normal
Status: Closed

NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.