[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[TIGGE #EGL-584516]: LDM regexp matching

Hi Baudouin,

Sorry for the delay in responding...

> are spaces handled in the regex?

Yes.  A space is no different than any other character.

> We push products with:
> pqinsert -p "$name $rand" $filename
> where $rand is a random number so NCAR can control the number of stream
> with 20 rules like (there is a space after "grib"):

I should have commented on this earlier.  We think that it would
be better to use a sequence number instead of a random number.
The sequence number could be used to segment the stream into
pieces and as a tracer that demonstrates that every product
inserted into the upstream queue (your machine) is received by
the downstream machine (dataportal).

This is the approach we took in our CONDUIT datastream, and it
has been quite successful. The only thing that you would need
to worry about is when to reset the counter.  In CONDUIT, the
products inserted in the queue come from the output of a model
at a particular timestep.  A program written to understand the
model output being producted carves up the output file (which
contains all fields for the timestep) into individual GRIB or
GRIB2 messages (depending on what the model output), records
the products into a manifest file (Manuel asked about this
previously), and then inserts them into the LDM queue using
information about the product (e.g., parameter, level time,
forecast time, etc.) and a monotonically increasing sequence
number as part of the product header.  The descriptive header
can then be used by downstream machines to select tailored
subsets of the stream.  In the case or TIGGE, we recommend
that the downstream machine (dataportal) subset the streams
from you into 10 mutually exclusive requests to start with
to see if the split is enough to allow ingestion of all products
with minimal latencies.  If additional splitting is needed, 
it can be done pretty easily.
> tigge_\([a-z]*\)_\([a-z]*\)_\(\d{8}\)_\(\d{4}\)_\(\d\)_.*.grib
> .*\(10|30|50|70|90\)$
> tigge_\([a-z]*\)_\([a-z]*\)_\(\d{8}\)_\(\d{4}\)_\(\d\)_.*.grib
> .*\(11|31|51|71|91\)$
> ...
> I have been looking at our log files, and the last group of number seam
> to match number anywhere in the product name:
> 'tigge_ecmf_fc_20060331_0000_0_0_potential_temperature_0.grib 11335'

I assume that this is the product metadata.

> -> tigge_\([a-z]*\)_\([a-z]*\)_\(\d{8}\)_\(\d{4}\)_\(\d\)_.*.grib
> .*\(11|31|51|71|91\)$

If this is the regular expression being used to match the header, then
I believe (but must check with our LDM expert) that your systax
is incorrect.  For instance, I have never seen a regular expression


Also, where is this regular expression being used (e.g., in ldmd.conf
or pqact.conf or in the '-p' pattern on the command line of the
LDM 'notifyme' utility)?

> 'tigge_ecmf_fc_20060331_0000_0_0_snow_depth_0.grib 11333'
> -> tigge_\([a-z]*\)_\([a-z]*\)_\(\d{8}\)_\(\d{4}\)_\(\d\)_.*.grib
> .*\(11|31|51|71|91\)$
> 'tigge_ecmf_fc_20060331_0000_0_0_convective_available_potential_energy_0.grib
> 11350'
> -> tigge_\([a-z]*\)_\([a-z]*\)_\(\d{8}\)_\(\d{4}\)_\(\d\)_.*.grib
> .*\(10|30|50|70|90\)$
> 'tigge_ecmf_fc_20060331_0000_0_0_geopotential_700.grib 11338'
> -> tigge_\([a-z]*\)_\([a-z]*\)_\(\d{8}\)_\(\d{4}\)_\(\d\)_.*.grib
> .*\(10|30|50|70|90\)$
> In the first example, the 11 from 11333 is matched. In the second the 50
> from 11350 is matched, which is the expected behaviour.
> In this last example, is seems that the 70 from the end of the regex
> matches the 700.
> The result is that the same product matched several regex and is sent
> several time to NCAR.

If the objective is to use the regular expression in the 'request' line
on the downstream machine (dataportal), then I suggest that the process
can be made alot simplier:

- change the random number being appended to the header to a monotonically
  increasing sequence number

- construct the ~ldm/etc/ldmd.conf request lines on the downstream to
  select a subset of the products.  This is exactly what we did during
  the testing phase -- we used the sequence number Manuel had added
  to the products and split the request into something like 10 request
  lines.  For instance, if the monotonically increasing sequence number
  is the last part of the product header, then your request lines could
  look like:

request EXP "0$" tigge-ldm.ecmwf.int
request EXP "1$" tigge-ldm.ecmwf.int
request EXP "9$" tigge-ldm.ecmwf.int

  This should effectively split the feed into tenths.

  NB: it is possible that the feed could be split into some other fraction
  like fifths, twentiths, etc.  We need to experiment to find out
  how small a split is needed to successfully get all products with
  the minimum of latency.

> Baudouin
> PS: Doug can you change the space in the regex to a colon (:)

If our suggestion about inclusion of the monotonically increasing
sequence number is accepted, then I want to push for a simplification
in the request regular expression being used.  Doug and Dave, we
can take a look at what you have and make recommendations if we
are allowed to login to dataportal as 'ldm'.  It might be the case
that my and Mike Schmidt's login still works; I havn't checked.
The problem we encountered before was that we could see the setup
and edit configuration files, but we were unable to restart the LDM
since we were not allowed to login as 'ldm'.  This is OK, but increases
the time needed to affect a change (Dave can attest to this when he
returns from his vacation).


Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
Unidata HomePage                       http://www.unidata.ucar.edu

Ticket Details
Ticket ID: EGL-584516
Department: Support IDD TIGGE
Priority: Normal
Status: Closed