[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Seaspace satellite formats



Susan,

>       I got a note from Dave Johnson that Serafin has
>       okayed the satellite receive station. Do you have
>       any knowledge of the Terascan format they use...
>       isn't it some variant of netCDF?

It's TDF, the TeraScan Data Format.  It was developed by Joe Fahle, one of
the early contributors to the netCDF interface design.  TDF and netCDF have
diverged quite a bit since then, with TDF not being constrained by the
necessity for a Fortran interface or use of XDR for storing data portably.
TDF is a more advanced interface in many ways, implementing

  - a way to subset data without copying it and to reference data in other
    files ("assemblies");
  - a way to import ASCII or binary data from files in other formats without
    copying or converting it ("instant import");
  - built in handling of time;
  - built in support for georeferencing using rectangular, mercator, utm,
    polyconic, oblique stereographic, and polar stereographic projections;
  - support for "relations" (ordered lists of variables) besides dimensions,
    variables, and attributes;
  - support for a "string" data type;
  - built-in (required) attributes (e.g. units, badval, scale, offset, ...); 

Here's some comments from Joe in recent correspondence with him:

> We have an hdf-to-tdf converter (and vice-versa).  Dave Wilensky has
> done a lot of netcdf <-> tdf conversions at LSU.
> 
> Because of the XDR issue, we probably cannot do and 'uninstantiated'
> import of a netcdf dataset, which would be the best solution.
> 
> The converter approach pretty much is the way to go.
> 
> Joe
> 
> PS.  Not having the XDR stuff has kept Terascan from running on PC's.
> I have programmed up a solution that takes care of this without using
> XDR's,  but no questions, XDR's are the way to go for data portability 
> 
> As far as a C++ interface, that wouldn't take more than a day to whip
> that up.  But I'll never go back to Fortran.

I've appended a man-page overview of TDF.

--Russ


DATASETS(7)             TeraScan Overview             DATASETS(7)



NAME
     datasets - TeraScan common data format (TDF)

SYNOPSIS
     lib/libcdf.a

DESCRIPTION
     Introduction

     Each TeraScan dataset is a separate UNIX file  organized  in
     the  TeraScan  common  data  format  (TDF).   The  TDF is an
     extremely versatile file format that is capable  of  assimi-
     lating  a wide variety of data types, shapes and sizes.  For
     example, a single  dataset  could  contain  satellite  image
     data, random _i_n-_s_i_t_u data, and 3-D model data.

     The TDF was developed  during  the  same  period  that  NASA
     developed  the Common Data Format (CDF) [Treinish and Gough,
     1987], and served as a basis for the UNIDATA Network  Common
     Data Format (netCDF) [Rew, 1988].  The TDF has been substan-
     tially upgraded since then.

     Dimensions, variables,  relations  and  attributes  are  the
     basic  dataset  components.  Variables  are simply arrays of
     data; dimensions define the sizes of  these  arrays.   Rela-
     tions  are  ordered  lists  of  variables.   Attributes hold
     information about the dataset as a whole, or  about  indivi-
     dual  variables,  dimensions  or  relations.  Only datasets,
     variables, and  relations  can  currently  have  application
     defined attributes.

     The following datatypes may be used to define variables  and
     attributes;  byte,  short,  long, float, double, and string.
     Codes  and  ranges  for  these  datatypes  are  defined   in
     include/gp.h.   String  is  a variable-width datatype, i.e.,
     the number of bytes required to store one element is  appli-
     cation defined.  Applications can implement a complex-valued
     variable by adding an extra dimension of  length  2  to  the
     variable.

     Normally all dataset definitions and data are  stored  in  a
     single  UNIX  file.   However, a dataset can reference vari-
     ables from several files using  _l_i_n_k_s.   _L_i_n_k_s  allow  
rapid
     import of non-TDF data, and support lightweight dataset sub-
     sets and assemblies.

     Programming Interface

     TDF access routines are independent of  any  other  TeraScan
     software  component  except  lib/utils.a.   See  dirfile(3),
     misc(3), and terrno(3).  Therefore, TDF applications can  be
     written  without  using  TeraScan  user  interface  or earth



TeraScan              Last change: 1/13/93                      1






DATASETS(7)             TeraScan Overview             DATASETS(7)



     transform facilities.  TDF calls can be embedded in existing
     non-TeraScan applications as desired.

     TDF datatypes, constants, and error status codes are defined
     in include/gp.h.

     Object Pointers

     The basic TDF objects are sets, dimensions,  variables,  and
     relations.   Application-defined  attributes  are  not  con-
     sidered objects, even though they can be  treated  as  such.
     Files are secondary objects, and are of only passing concern
     to applications.

     Pointers to objects (actually object  data  structures)  are
     returned  by search or definition functions.  These pointers
     are used as arguments to other functions.  All  data  struc-
     tures  have magic numbers and alignment criteria which helps
     to identify bogus pointers.  A pointer to an  object's  data
     structure  is  "pinned"  (i.e.,  can never change) until the
     object is no longer available (i.e. the  containing  dataset
     is closed).

     All application accessible data structures exist  in  memory
     that is allocated using UNIX malloc().  malloc is used spar-
     ingly and in an unfragmented manner, so  as  not  to  impact
     applications which also use and (possibly abuse) malloc.

     Applications cannot be prevented from modifying data  struc-
     tures,  even for datasets opened as readonly. Given this, it
     was decided  to  let  applications  perform  all  operations
     except  variable I/O for readonly datasets, including defin-
     ing new variables, relations, and attributes.

     One obvious disadvantage of  having  application  accessible
     data  structures is that applications will undoubtably trash
     them more easily than if they were hidden.  All data  struc-
     ture components should be considered readonly, unless other-
     wise specified.

     Applications can loop through  a  list  of  similar  objects
     (e.g. all dimensions belonging to a dataset) using

          while (pointer != NULL) pointer = pointer->next;

     Attributes

     Attributes refine the definitions of datasets and their com-
     ponents.  There are two kinds of attributes:

     -   Built-in attributes, i.e., fields in application  acces-
         sible data structures  (See include/gp.h)



TeraScan              Last change: 1/13/93                      2






DATASETS(7)             TeraScan Overview             DATASETS(7)



     -   Application-defined attributes, created using the define
         or copy attribute functions

     Dimensions and files do not have application-defined  attri-
     butes.   The only file attribute of any interest to applica-
     tions is _f_i_l_e->_p_a_t_h, which is built-in.  
Application-defined
     dimension attributes may be added in the future.

     Note that applications are free to change names directly and
     potentially  generate name conflicts within a dataset.  This
     is the least harmful of all the ways applications can damage
     datasets.

     Different objects can have attributes with  the  same  name,
     but with different datatypes or lengths.  This new flexibil-
     ity should be used cautiously; two attributes with different
     meaning should never have the same name.

     The following built-in attributes are intended  for  use  by
     applications;  only  those marked (*) can be set directly by
     applications.

     * dim->name      - dimension name
       dim->unlimited - non-zero if dimension can grow
       dim->size      - current size
     * dim->coord     - dimension coordinate
     * dim->scale     - orig index = index * scale + offset
     * dim->offset

     * var->name      - variable name
     * var->units     - units
       var->type      - datatype
     * var->badval    - missing value as stored on disk
     * var->usemin    - minimum valid stored value
     * var->usemax    - maximum valid stored value
     * var->scale     - true value = scale * stored value + offset
     * var->offset

     * rel->name      - relation name
     * rel->kind      - relation kind (analogous to variable units)

     * att->name      - attribute name
     * att->units     - attribute units
       att->type      - datatype
       att->size      - number of elements in attribute

       file->path     - file path name

     Application defined attributes  are  normally  not  accessed
     like  objects.   Their values are set and retrieved by name,
     rather than by pointer.  Pointers to  attribute  definitions
     are  available  for getting attribute datatype, lengths, and



TeraScan              Last change: 1/13/93                      3






DATASETS(7)             TeraScan Overview             DATASETS(7)



     units, as well as looping through lists of attributes.

     Application Defined Relationships

     The new abstraction, "relation", has been added to datasets.
     A  relation  consists  of  an  ordered list of variables all
     belonging to the  same  dataset.   Relations  have  built-in
     attributes  "name" and "kind", where relation kind is analo-
     gous  to  variable   units.    Relations   also   can   have
     application-defined attributes.  The number and order of the
     variables  associated   by   a   relation,   as   well   its
     application-defined attributes, are determined by its kind.

     The following is an example of how relations can be used:

     Given a variable "date" that contains  an  ordered  list  of
     dates,  a  variable  "year" that contains an ordered list of
     years, and a variable "year_index" that is defined  as  fol-
     lows:

          index[i] = j if k > j => date[k] >= year[i]

     define the relation  "year_index"  of  kind  "sparse_index",
     consisting  of  the  ordered tuple (date, year, year_index).
     (Obviously, "date" and "year" must have the same  units  for
     this to work.)

     Builtin Relationships

     The following  relationships  are  built-in  to  application
     accessible  data  structures;  only  those marked (*) can be
     changed directly by applications:

       var->dim[], var->ndims           - variable has dimensions
       rel->var[], rel->nvars           - relation relates variables

     * dim->var    - a dimension can get its values from a variable
                     i.e., value coresponding to dim=i is var[i]
       var->file   - a variable's data is stored in a file

       set->natts, set->att, att->next  - dataset has attributes
       var->natts, var->att, att->next  - variable has attributes
       rel->natts, rel->att, att->next  - relation has attributes

       set->ndims, set->dim, dim->next  - a dataset has dimensions
       set->nvars, set->var, var->next  - a dataset has variables
       set->nrels, set->rel, rel->next  - a dataset has relations

       firstset, set->next   - a program has a list of datasets

       dim->owner    - a dimension belongs to a dataset
       var->owner    - a variable belongs to a dataset



TeraScan              Last change: 1/13/93                      4






DATASETS(7)             TeraScan Overview             DATASETS(7)



       rel->owner    - a relation belongs to a dataset
       att->owner    - an attribute belongs to a dataset, variable,
                    or relation

     Pointers are used to represent all  built-in  relationships.
     Linked  lists  are  used  for all "has" relationships except
     two:  var->dim[] and rel->var[].  In both cases, these asso-
     ciations  are  "many-to-many".  Linked lists are impractical
     due to multi-threading.  Instead,  variable  dimensions  and
     relation variables are stored in arrays. The number of vari-
     able dimensions is limited (e.g., GP_VAR_DIMS =  5).   There
     is no limit on the number of relation variables.

     Some built-in relationships are  circular;  e.g.  var->dim[]
     and  dim->var,  or  set->var  and  var->owner.   Due  to the
     hierarchical nature of declarations  in  C,  some  of  these
     pointers have to be declared of type "char", which is unfor-
     tunate.

     Scaled Variable Data

     In original TeraScan datasets,  information  for  converting
     8-bit  or  16-bit data to real values was stored in applica-
     tion defined scaling attributes.   Now,  scaling  attributes
     are  built-in  to  all  variables,  regardless  of datatype.
     var->scale and var->offset are used to convert  stored  data
     to its true form:

     true value = var->scale * stored value + var->offset

     Note, built-in attributes var->badval, var->usemin and  var-
     >usemax  all  refer to stored values.  When presenting these
     attributes to users, applications may want to apply  scaling
     to at least var->usemin and var->usemax.

     The most common use of scaling is to store real-valued  data
     with  a  minimum yet appropriate number of significant bits.
     However, scaling can be used to help change  variable  units
     without  changing  actual data; e.g., to change from degrees
     Celsius to degrees Fahrenheit:

           gpputname(var->units, C_FAHRENHEIT);
           var->scale  *= 1.8;
           var->offset += 32.;

     Another benefit of builtin scaling is that it allows  appli-
     cations  to  pretend  they are working with a single type of
     data: double precision.  Variable read and  write  routines,
     that  respectively  scale  and unscale data, are provided as
     part of the standard interface.  This does not preclude  the
     writing  of  applications  that  treat each type of variable
     differently.



TeraScan              Last change: 1/13/93                      5






DATASETS(7)             TeraScan Overview             DATASETS(7)



     Dimension Coordinates

     Applications may use the coord, scale,  and  offset  builtin
     dimension  attributes  to  relate different dimensions.  For
     example, if two dimensions have the  same  coord  attribute,
     applications  may  choose  to decide that the two dimensions
     are parallel.  The scale and offset attribute  can  then  be
     used  to  determine the exact correspondence between the two
     dimensions, assuming that correspondence is linear.

     Coordinate   types   GP_X_COORD,   GP_Y_COORD,   GP_Z_COORD,
     GP_TIME_COORD,  and  GP_NO_COORD are defined in include/gp.h
     for this purpose.  Applications are not restricted to  these
     coordinate types.

     Unlimited (Growing) Dimensions

     Unlimited  dimensions  can  be  defined  using  a  size   of
     GP_UNLIMITED,  found  in include/gp.h.  The following guide-
     lines apply when working with datasets with unlimited dimen-
     sions:

     -   Only one dimension in a dataset can be growing; defining
         a  second  unlimited  dimension will fix the size of the
         former growing dimension.

     -   If a variable is defined with a growing dimension,  that
         dimension must be the variable's leading dimension.

     -   All variables to be defined  with  a  unlimited  leading
         dimension  must  be  defined  prior  to writing any data
         corresponding to that dimension. The size of the  unlim-
         ited  dimension will be fixed at the point where the new
         variable is defined.

     Cloning Objects.

     Cloning an object refers to the process of creating  a  like
     object with the same attributes, optionally with a new name.
     When a variable is cloned, the new variable is created  with
     the  same  named dimensions.  These dimensions must exist in
     the output dataset, but do not have to have the  same  sizes
     as  the  corresponding  dimensions of the original variable.
     Similarly, when a relation is cloned, the  new  relation  is
     created, associating the same named variables.

     When a dimension is cloned, its corresponding  variable  (if
     one  is  defined)  is not carried over to the new dimension.
     This would present a _c_h_i_c_k_e_n _a_n_d _e_g_g  problem,  
because  the
     dimension could not be created without the variable, and the
     variable could not be created without the dimension.




TeraScan              Last change: 1/13/93                      6






DATASETS(7)             TeraScan Overview             DATASETS(7)



     Definitions vs. Variable Data

     Everything about a dataset with the  exception  of  variable
     data  is  maintained  in virtual memory until the dataset is
     closed or synced.  If a dataset is opened  for  read  access
     and then is closed, nothing is written to disk regardless of
     whether the application changed attribute values or  defined
     new objects.

     If a dataset is opened with write access and then is closed,
     all  object  definitions  and  attributes are saved to disk.
     Saving definition and attribute changes can be suppressed by
     aborting the dataset rather than closing it.

     However, changes to variable data occur at the whim  of  the
     underlying  file system.  Variable data is not maintained in
     virtual memory, but is written to directly to the file  sys-
     tem.   Aborting  a  dataset in the midst of writing variable
     data will  leave  the  dataset  in  an  undefined,  probably
     unreadable state.

     TeraScan datasets support random _h_y_p_e_r_c_u_b_e access  to  
vari-
     able  data.  A hypercube is defined by a starting 0-relative
     coordinate, (_i_1,_i_2,...) and a cube size (_n_1,_n_2,...).   
Vari-
     able  indexing  is  similar to array indexing under C; i.e.,
     the index of the last dimension is the fastest moving.

     Link Subsets and Assemblies

     Any array data that can support random hypercube access  can
     be  linked to a TDF variable.  For example, data for a vari-
     able or variable hypercube in one TeraScan  dataset  can  be
     linked  to  a  variable  in  another  (or the same) TeraScan
     dataset.  This _l_i_n_k mechanism allows data from one  or  more
     datasets to be linked to a single dataset without instantia-
     tion, i.e., without moving any data around.

     The following TeraScan applications take advantage  of  this
     link mechanism.

     subset          Creates a variable and/or  dimension  subset
                     of input datasets.

     assemble        Gathers  selected   variables   from   input
                     datasets into a single output dataset.

     burst           Slices variables along any dimension, creat-
                     ing link variables for each of the slices.

     impbin          Imports structured array data  from  non-TDF
                     files.




TeraScan              Last change: 1/13/93                      7






DATASETS(7)             TeraScan Overview             DATASETS(7)



     This link mechanism is similar  to  the  UNIX  facility  for
     creating  symbolic  file links.  One drawback of using links
     is that links can be orphaned.  If data in file _X is  linked
     to  a  variable  _V in dataset _A, and then _X is removed, then
     the link variable _V is orphaned.

     As a special case, a _N_U_L_L file can be linked to a TDF  vari-
     able.   In this case, all stored values for the variable are
     assumed to be 0.

     Automatic Uncompression

     Datasets that have been compressed using the  UNIX  compress
     function  can  be  uncompressed  automatically  by TeraScan.
     TeraScan uses the UNIX zcat function to uncompress datasets,
     redirecting  the  output to the scratch directory defined by
     the environment variable UNCOMPRESSDIR.  If UNCOMPRESSDIR is
     undefined, uncompression is not attempted.

     A list of automatically uncompressed files is  kept  in  the
     Registry  file in the UNCOMPRESSDIR.  This file is ASCII but
     is not  intended  to  be  edited.   For  each  automatically
     uncompressed  file, the following information is shown: true
     path name of original, full path name of uncompressed  copy,
     last  modification  time of original is seconds, and the max
     idle time in seconds.

     Idle time is defined to be difference  between  the  current
     time and the last access time of the original.  The environ-
     ment variable UNCOMPRESSIDLE specifies the maximum idle time
     in   minutes   for  automatically  uncompressed  files.   If
     UNCOMPRESSIDLE is not set, the maximum idle time is  assumed
     to  be  60 minutes.  Different files can have different max-
     imum idle times.

     The environment variable UNCOMPRESSMAX specifies the maximum
     space  in megabytes to be allocated in the UNCOMPRESSDIR for
     automatically uncompressed files.  If UNCOMPRESSMAX  is  not
     set,  the  maximum is assumed to be 10 megabytes.  This max-
     imum is only a  rough  limit;  see  the  algorithm  outlined
     below:

          Given input compressed file _F

          If UNCOMPRESSDIR is not defined, can't uncompress _F

          If _F is in Registry, _F's last modification time matches
          what's in the Registry, and _F's uncompressed copy still
          exists, use it

          Delete all entries in Registry if  original  no  longer
          exists,  original's  last  modification  time  does not



TeraScan              Last change: 1/13/93                      8






DATASETS(7)             TeraScan Overview             DATASETS(7)



          match Registry, uncompressed copy does  not  exist,  or
          idle  time  (e.g.,  current  time - last access time of
          original) exceeds the max idle time

          While the total space occupied by  uncompressed  copies
          plus the size of _F (not its uncompressed copy!) exceeds
          UNCOMPRESSMAX, delete the entry in Registry closest  to
          exceeding its max idle time

          Uncompress _F and put it in the  Registry,  setting  its
          max idle time to UNCOMPRESSIDLE.

     Hard Limits

     There are  currently  only  two  hard  limits  for  TeraScan
     datasets: length of names and number of variable dimensions.
     The name length limit applies not only to names, but to such
     built-in  attributes as var->units and rel->kind.  Arbitrary
     name lengths were not implemented for the following reasons:

     -   Applications are invariably written assuming  a  maximum
         name length, which may as well be constant across custo-
         mer sites.

     -   If names  have  unlimited  length,  built-in  attributes
         var->units  and  rel->kind  also  would  have  unlimited
         length.

     -   Unlimited length names mean more extensive use  of  mal-
         loc, which has been avoided.

     Error Handling

     Pipeline processing applications, interactive display appli-
     cations,  and  application  subsystems (e.g., TeraScan earth
     trasnform) have very different error handling requirements:

     -   Pipeline processing applications typically take  a  very
         brutal approach to errors; i.e., abort!

     -   Interactive display applications must always return con-
         trol  to  the user, even on such _s_h_o_w _s_t_o_p_p_i_n_g 
errors as
         running out diskspace or memory

     -   Application subsystems must always return control to the
         application  after  converting  lower  level error codes
         into higher level  ones  (e.g.,  no  such  attribute  =>
         dataset does not have earth location).

     In order to support  these  different  cases,  a  switchable
     error handler is used by all the dataset interface routines.
     (See CALLING SEQUENCES.) An application subsystem can switch



TeraScan              Last change: 1/13/93                      9






DATASETS(7)             TeraScan Overview             DATASETS(7)



     its  own  error  handler  in  and out several times while an
     application is running.

     The default error handler simply sets  the  Terascan  global
     variable  terrno  to the approriate error code.  In addition
     to UNIX file open and memory allocation errors, the  follow-
     ing  errors  may  be  encountered.   These  error  codes are
     defined in include/gp.h.

FILES
     include/gp.h,         lib/libcdf.a,          lib/libutils.a,
     /usr/include/errno.h

SEE ALSO
     gpatt(3), gpdim(3), gperr(3), gpio(3), gplink(3), gpname(3),
     gprel(3),   gpset(3),   gptype(3),   gpvar(3),   dirfile(3),
     misc(3),   terrno(3),    open(2),    close(2),    malloc(3),
     compress(1) One of the strong points of the TDF and its pro-
     gramming interface is that applications do not depend on the
     physical  layout  of data on disk.  The physical layout of a
     typical dataset is as follows:

          - dataset header of 644 bytes (historical)
          - data for non-link variables
          - file descriptions for link variables
          - dataset attributes
          - dimension descriptions
          - variable descriptions and attributes
          - relation descriptions and attributes

     The start of data for a given variable is  defined  by  _v_a_r-
     >_d_a_t_a_s_t_a_r_t.   Data  for  non-link  variables  is  
guaranteed
     either to be completely contiguous or  _r_o_w-_w_i_s_e  contiguous.
     The  _i_t_h  row  of array _A is defined to be all elements of _A
     with leading index _i.  The distance  between  rows  is  _v_a_r-
     >_d_i_m_d_i_s_t[_0].



















TeraScan              Last change: 1/13/93                     10