[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [njtbx-users] Problem accessing NcML file via OpenDAP



More answers...

On 27/05/2010 11:31 PM, John Caron wrote:
Giles Lesser wrote:
Hi John / all

Good questions. Answers follow:

On 27/05/2010 10:28 AM, John Caron wrote:

what will the queries be?
The most common queries will be for data in date ranges. eg the
"latest X hours/days" or "all data available spanning the date range x
to y". Other more interesting queries are possible for forecasts (as
per the FMRC aggregation) such as "all forecasts that have been
produced for a certain (valid) time" or "all X-hour forecasts for
analysis times between X and Y")

What will be the queries be that need to be fast?
The common queries described above. The Forecast queries are not so
speed-sensitive.

how fast is fast?
Fraction of a second. Say order of 0.1 secs
how big are your files and how many in an aggregation?
For forecasts, we are storing 1 analysisTime per file. For each analysis time there are several forecast times for each of several stations. We have only just started creating .nc files of these, but 1d spectra forecasts are around 25KB. 2d spectra are about 400KB.

How many in an aggregation all depends on the strategy we implement - this is part of the question. So far we are simply aggregating all the separate forecast files - so 2 per day * duration of dataset (several years). Clearly we don't need to aggregate this many just to get the latest, or just the latest few days worth.

For measurements we are appending new measurements to the same .nc file until every day/week/month/year (we haven't figured out what is optimal yet) we roll to a new file. We would then need to aggregate the separate files if we wanted to return data for more than just a day/week/month/year - whatever is chosen as the roll period.



In what format do you want the data to be returned?

Not sure that I understand this. We will ultimately use it in Matlab,
Python, or C#.NET dataset objects on the client(s).

Do you need to use opendap on the client?
Not necessarily, but this did seem a convenient way to access datasets on remote servers (in terms of permissions, security, firewalls, etc)

 Are you using the netcdf-java
library in Matlab?
Yes
 Are there other custom clients?
Not yet.


2. It is essential that we can re-create the exact state of the
datasets
at specific times - for re-creating queries at a later time in case
questions arise. This makes me wary of the caching built into TDS -
unless
the "refresh every" time is set very small, in which case what is the
point...?

If data has not yet arrived, you will get different results. How does
that fit into the need for reproducibility?

"not (yet?) arrived"?
All our data is also being given an "insertedTime" record in the .nc
file. Ie the queries described above are actually be a little more
complex. They are actually: "latest X hours/days of data with an
insertedTime<=Z" or "all data available spanning the date range x to y
with an insertedTime<=Z". Ditto for forecasts. By default Z=Inf and
all data in the file are returned.

We do this "filtering" of the data in our NetCDF driver. ie, the
server is just asked for all data as per the simple query. The removal
of data which doesn't satisfy the insertedTime criterion is done in
the driver.

So the only thing you need from the server is to do time subsetting,
which can be efficiently done simply by specifying an index range?
Yes - well, two queries seem to be required for that. First get the time values for each index, find the indices corresponding to the time range required, and then query for the data for the identified indices.



3. I don't like the idea of having to restart the TDS every time a
dataset
definition is updated in the catalog.xml (it would need to be
restarted
very frequently).

Not sure what
you mean by the dataset definition? Why is it getting updated?

My poor terminology. I meant that if the NcML aggregation was embedded
in the catalog.xml file (and included specific reference to each file
in the aggregation - with the coordinate values specified) then the
catalog.xml file would need to be updated each time an extra file
needs to be added to the aggregation. I realise that this isn't
necessary if the NcML aggregations were scans, but then you get into
the whole "rescan every" and caching saga - which (I perceive)
complicates satisfying the reproducibility requirement.

well, my first thought is that the reproducibility problem comes from
having a dataset that is changing.
Indeed.
 Its a difficult problem to balance
that against "fast access", which requires caching. Netcdf-3 files which
are being appended to along the unlimited dimension can be made to work.
However, theres no standard way to communicate back to the opendap
client that the index has changed.
i am pretty sure you cant solve this problem by modifying the catalogs
with explicit file references, you dont gain anything over the ncml scan.
Maybe I'm not understanding the way the caching works. As I understand it, if a ncml scan has a "recheckEvery" element and the recheckEvery time has passed since the last read of the dataset then the scan is repeated. Each file matching the wildcard is opened and the data in the aggregation dimension is extracted. It appears that a dataset cache is then built in the /cageAged folder on the TDS server. These dataset caches LOOK very like the type of NcML aggregation where the "coordValue" element is specified in the NcML file. These are the types of NcML files we are constructing externally (with our python scripts) ie we are explicitly specifying all the individual files AND the corresponding coordValue(s) in our NcML aggregation files.

I was/am under the impression that this type of aggregation is a lot more efficient to read/construct than the simple scan aggregation as each individual file DOES NOT need to be opened to construct the aggregation dimension.

Am I making any sense?




what issues do you have with the SQL database?

I think our overarching problem is that the data model in SQL just
doesn't really fit the data types in a very natural way. Sure you can
represent pretty much anything in a relational database, but for the
array-based data we use the abstraction can become messier than we
would like.

perhaps i could get a sample netcdf file that you anticipate using?
I have attached two sample (1D spectrum) forecast files to this email, along with an example of the kind of ncml aggregation file we are (automatically) constructing.

Many thanks


Giles



We also have issues related to:
- licencing - deploying databases to client site locations
- Database size can become very large and may exceed permitted sizes
within licences.
- Removal and archiving of data is painful/slow
- We have had problem with concurrent access to databases in the past.
- It is more difficult to move/copy/backup etc databases than simple
files.
- We are attracted by the promise of simpler remote data access using
existing clients in the case of OpenDAP.



Giles


______________________________________________________________________________

Giles Lesser, PhD | Research and Development Manager & Senior Coastal
Engineer


OMC-International | 6 Paterson St | Abbotsford, VIC 3067

Melbourne | Australia

Phone +61 (3) 9412 6501 | Fax +61 (3) 9415 9105

http://www.omc-international.com.au

Dedicated to safer and more efficient shipping.

CONFIDENTIAL COMMUNICATIONS. The information contained in this e-mail is
confidential and may be subject to legal professional privilege. It is
intended solely for the addressee. If you received this correspondence by
mistake, please promptly inform us by reply e-mail or by telephoning
+61 3
9412 6500 and then delete the e-mail and destroy any printed copy. You
must
not disclose, copy or rely on any part of this correspondence if you
are not
the intended recipient.

Attachment: spec1d_2009010800.nc
Description: Binary data

Attachment: spec1d_2009010812.nc
Description: Binary data

<?xml version="1.0" encoding="utf-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>    
<aggregation dimName="analysisTime" type="joinExisting">     
<netcdf coordValue="643507200" location="\2010\spec1d_2010052400.nc"/>
<netcdf coordValue="643550400" location="\2010\spec1d_2010052412.nc"/>
<netcdf coordValue="643593600" location="\2010\spec1d_2010052500.nc"/>
<netcdf coordValue="643636800" location="\2010\spec1d_2010052512.nc"/>
<netcdf coordValue="643680000" location="\2010\spec1d_2010052600.nc"/>
<netcdf coordValue="643723200" location="\2010\spec1d_2010052612.nc"/>
<netcdf coordValue="643766400" location="\2010\spec1d_2010052700.nc"/>
<netcdf coordValue="643809600" location="\2010\spec1d_2010052712.nc"/>
<netcdf coordValue="643852800" location="\2010\spec1d_2010052800.nc"/>
<netcdf coordValue="643896000" location="\2010\spec1d_2010052812.nc"/>
<netcdf coordValue="643939200" location="\2010\spec1d_2010052900.nc"/>
<netcdf coordValue="643982400" location="\2010\spec1d_2010052912.nc"/>
<netcdf coordValue="644025600" location="\2010\spec1d_2010053000.nc"/>
<netcdf coordValue="644068800" location="\2010\spec1d_2010053012.nc"/>
<netcdf coordValue="644112000" location="\2010\spec1d_2010053100.nc"/>
</aggregation>  
</netcdf>