Re: [netcdfgroup] storing sparse matrices data in NetCDF

Hi Ken

I am not a Python programmer.
This is not Python code.
Just an algorithm outline.
Not tested, not guaranteed, just an idea.

1) Read the CSV file into a list of tuples [(ID,x,y,lon,lat,elev)].
That is, read just the static information that doesn't vary with time, the
first 6 lines,
which corresponds to the geographic location information only.

2) Sort the list of tuples above (using Python's "sorted") by primary key
lat and secondary key lon.
Let's say, in ascending order.
See these links:
https://docs.python.org/3/howto/sorting.html#
https://docs.python.org/3/howto/sorting.html#ascending-and-descending
https://docs.python.org/3/howto/sorting.html#sort-stability-and-complex-sorts

You could add "elev" as sort key, if you want to make "elev" another
array/cordinate variable,
in which case the primary key would be elev, the secondary key lat, the
tertiary key lon.

3) Extract from the sorted list of tuples an array (or list) of [lon], then
weed out any repeated values.
The final lenght of [lon] is the lon dimension.
Python must have a method to perform the extraction and another to weed out
the repeated values.
Numpy has "numpy.unique" for the latter, maybe the list object has a
similar method.
Do the same with [lat], and optionally with [elev].
These will give you the arrays/coordinate_variables lat(lat) and lon(lon)
in ascending order (and maybe elev(elev) too).

4) Create an empty time(time) coordinate variable as well (with length
equal to the number of time records).

5) Create an empty runoff(time,lat,lon) 3D array (or
runoff(time,elev,lat,lon) 4D array)
and initialize it with with _FillValue / missing_value (Something way off
the range of possible runoffs will work,
but netCDF has default _FillValues for float, double, etc.)


6) Then

    itime=0
    for each time record (i.e. line) in the CSV file:
         Read the CSV file again, skipping the static information (lines
1-6), but one line/time_record each step, into a list [(time,ID_A,runoff)].
         Assign time(itime)=itime (if your time units are "days since
01-01-1980", starts with 0).
               for each ID_A of this list [(time,ID_A,runoff)] (which is
sorted by ID_A, not by (lat,lon) or by (elev,lat,lon))
                       Search for that ID_A on the list
[(ID,x,y,lon,lat,elev)]  (which is sorted by (lat,lon))
                       When ID_A is found, assign the runoff array/variable
value:  runoff(time,lon,lat)=runoff(ID=ID_A)
                        (Or runoff(time,elev,lat,lon)=runoff(ID=ID_A) if
you decided to add elev as a coordinate.)
         itime=itime+1

7) Create the netCDF file using the time,lev,lat,lon, dimensions, the
coordinate variables time(time),lev(lev),lat(lat),lon(lon),
and the runoff(time,lev,lat,lon) variable obtained above.

The runoff variable will be gridded, although the grid spacing probably
won't be uniform or regular.
You're likely to have a large number of data points with
_FillValue/missing_value,
but all genuine data will be represented.

You can also create variables x(lat,lon) and y(lat,lon), if you want to
preserve this information in the netCDF file also.
Create them empty, initialize with _FillValue/missing_value, and assign
values "by_ID" in a search like the one in
step 6), just that you don't need to loop over time records.

Gus Correa

On Mon, Mar 18, 2019 at 7:25 PM Ken Mankoff <mankoff@xxxxxxxxx> wrote:

>
> I could always grid the data - I have 20,000 outlets around the coast of
> Greenland at 90 m resolution. The total number of cells is 500 million.
> This seems inefficient, but the 499,980,000 empty cells do compress quite
> well. It is ~15 MB/day (5 GB/year). Maybe compressing multiple days
> improves on this? 5 GB/year is tractable for 40 years of data.
>
> But as someone who works with NetCDF, wouldn't this data presentation make
> you cringe? On the other hand, NetCDF works best when gridded, right?
>
>   -k.
>
>
> On 2019-03-18 at 16:19 -0700, Gus Correa <gus@xxxxxxxxxxxxxxxxx> wrote...
> > Hi Ken
> >
> > That is true.
> > I suppose both CDO and NCO (ncks) assume the lat and lon are
> > monotonic (increasing or decreasing) coordinate variables, and that
> > runoff has (time,lat,lon) dimensions, not (time,ID).
> > ID is not a coordinate, it is just a label for your observation
> stations, I
> > guess.
> >
> > You could devise a more elaborate scheme to define lat and lon
> dimensions,
> > then,
> > lat(lat) and lon(lon) coordinate variables, and from there create a 3D
> > runoff(time,lat,lon) variable.
> > There are several hurdles though:
> > 1) The values of lat and lon in your CSV file may have repetitions (this
> > affects the actual lenght of each dimension, which may be <20000).
> > 2) The values of lat and lon in your CSV file may not be monotonically
> > ordered (either in increasing or decreasing order).
> > I didn't spot any repetitions in the sample file you sent (but the full
> > file may have repetitions),
> > but the lat and lon are definitely not monotonically ordered,
> > they can go up, then down, then up again ...
> > Bona fide coordinate variables must be monotonic.
> > 3) Even if you weed out repetitions in lat or lon, and sort them in
> > increasing or decreasing order,
> > you would have to exchange also the corresponding runoff values, so that
> > they continue to belong to the correct station/location/ID,
> > i.e. sort the whole file with (lat,lon) as primary and secondary keys.
> >
> > Maybe Python has a sort routine that does all that for you gracefully,
> some
> > variant of qsort perhaps.
> >
> > Gus
> >
> >
> > On Mon, Mar 18, 2019 at 6:50 PM Ken Mankoff <mankoff@xxxxxxxxx> wrote:
> >
> >> Hi Sourish, Gus, and Elizabeth,
> >>
> >> Thank you all for your suggestions. I think I've found something that
> >> works, except for one issue. Please excuse my likely incorrect use of
> >> terminology - being new to NetCDF creation I may say something
> incorrect,
> >> but I hope the data dump below speaks for itself.
> >>
> >> Because my data is 2D (time, ID), then those are the dimensions, and
> >> lon,lat,x,y become variables on the ID dimension. This means my standard
> >> netcdf tools for slicing based on spatial dimension don't work. For
> example,
> >>
> >> cdo sellonlatbox,83.5,85,-27,-28 ds.nc bar.nc
> >>
> >> or
> >>
> >> ncks -d lat,83.5,85 -d lon,-27,-28 ds.nc bar.nc
> >> # ncks: ERROR dimension lat is not in input file
> >>
> >> Is there a way to make the data 2D but have the 2nd dimension be
> >> (lon,lat)? Even if yes, I don't imagine the cdo and ncks tools would
> work
> >> on that dimension... Is there a cdo, nco, or ncks (or other) simple tool
> >> I'm missing that can work with this non-gridded data the way those
> tools do
> >> so easily work with gridded data?
> >>
> >>
> >> Anway, here is the Python xarray code I got working to produce the
> NetCDF
> >> file, reading in the 'foo.csv' from my previous email and generating
> ds.nc.
> >> Once I understood the NetCDF structure from the file Sourish provided, I
> >> was able to generate something similar using a higher level API - one
> that
> >> takes care of time units, calendar, etc. I leave out (x,y,elev) for
> brevity.
> >>
> >>
> >>   -k.
> >>
> >>
> >>
> >> df = pd.read_csv('foo.csv', index_col=0, header=[0,1,2,3,4,5])
> >> df.index = pd.to_datetime(df.index)
> >>
> >> # Build the dataset
> >> ds = xr.Dataset()
> >> ds['lon'] = (('ID'), df.columns.get_level_values('lon'))
> >> ds['lat'] = (('ID'), df.columns.get_level_values('lat'))
> >> ds['runoff'] = (('time', 'ID'), df.values)
> >> ds['ID'] = df.columns.get_level_values('ID')
> >> ds['time'] = df.index
> >>
> >> # Add metadata
> >> ds['lon'].attrs['units'] = 'Degrees East'
> >> ds['lon'].attrs['long_name'] = 'Longitude'
> >> ds['lat'].attrs['units'] = 'Degrees North'
> >> ds['lat'].attrs['long_name'] = 'Latitude'
> >> ds['runoff'].attrs['units'] = 'm^3/day'
> >> ds['ID'].attrs['long_name'] = 'Basin ID'
> >>
> >> ds.to_netcdf('ds.nc')
> >>
> >>
> >>
> >>
> >> And here is the ncdump of the file
> >>
> >>
> >>
> >>
> >>
> >> netcdf ds {
> >> dimensions:
> >>         ID = 10 ;
> >>         time = 5 ;
> >> variables:
> >>         string lon(ID) ;
> >>                 lon:units = "Degrees East" ;
> >>                 lon:long_name = "Longitude" ;
> >>         string lat(ID) ;
> >>                 lat:units = "Degrees North" ;
> >>                 lat:long_name = "Latitude" ;
> >>         double runoff(time, ID) ;
> >>                 runoff:_FillValue = NaN ;
> >>                 runoff:units = "m^3/day" ;
> >>                 runoff:long_name = "RACMO runoff" ;
> >>         string ID(ID) ;
> >>                 ID:long_name = "Basin ID" ;
> >>         int64 time(time) ;
> >>                 time:units = "days since 1980-01-01 00:00:00" ;
> >>                 time:calendar = "proleptic_gregorian" ;
> >>
> >> // global attributes:
> >>                 :Creator = "Ken Mankoff" ;
> >>                 :Contact = "kdm@xxxxxxx" ;
> >>                 :Institution = "GEUS" ;
> >>                 :Version = 0.1 ;
> >> data:
> >>
> >>  lon = "-27.983", "-27.927", "-27.894", "-28.065", "-28.093", "-28.106",
> >>     "-28.155", "-27.807", "-27.455", "-27.914" ;
> >>
> >>  lat = "83.505", "83.503", "83.501", "83.502", "83.501", "83.499",
> >> "83.498",
> >>     "83.485", "83.471", "83.485" ;
> >>
> >>  runoff =
> >>   0.023, 0.01, 0.023, 0.005, 0, 0, 0, 0, 0, 0,
> >>   0.023, 0.01, 0.023, 0.005, 0, 0, 0, 0, 0, 0,
> >>   0.024, 0.013, 0.023, 0.005, 0, 0, 0, 0, 0, 0,
> >>   0.025, 0.012, 0.023, 0.005, 0, 42, 0, 0, 0, 0,
> >>   0.023, 0.005, 0.023, 0.005, 0, 0, 0, 0, 0, 0 ;
> >>
> >>  ID = "1", "2", "5", "8", "9", "10", "12", "13", "15", "16" ;
> >>
> >>  time = 0, 1, 2, 3, 4 ;
> >> }
> >>
> >> _______________________________________________
> >> NOTE: All exchanges posted to Unidata maintained email lists are
> >> recorded in the Unidata inquiry tracking system and made publicly
> >> available through the web.  Users who post to any of the lists we
> >> maintain are reminded to remove any personal information that they
> >> do not want to be made public.
> >>
> >>
> >> netcdfgroup mailing list
> >> netcdfgroup@xxxxxxxxxxxxxxxx
> >> For list information or to unsubscribe,  visit:
> >> http://www.unidata.ucar.edu/mailing_lists/
> >>
>
>
  • 2019 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: