Re: [netcdfgroup] storing sparse matrices data in NetCDF

  • Subject: Re: [netcdfgroup] storing sparse matrices data in NetCDF
  • From: Sourish Basu <Sourish.Basu@xxxxxxxxxxxx>
  • Date: Mon, 18 Mar 2019 19:58:40 -0600
  • Autocrypt: addr=Sourish.Basu@xxxxxxxxxxxx; prefer-encrypt=mutual; keydata= mQINBFZ40gQBEADvGCqRa5XGgaROg6TYUieMAh5GDTy6lclqxdKqu4oYSROUFYkEuT4tOHpV 4k6Ruhg1EYXMl0siTQ4VsTcvaFBR3RLKiOdRxsh4jPVrZI1TWJPJRWlvNg5iPXczTkH3diyA 2Pp4CBNfpw/M5uUHMgfL/A+1AAT0ciMUq9eR5U8bcjvfemg6Js/+tsNOwyHGlZEEXOrod/eb NrqOB1FA0WFHvEkFgJ1Ed/g3ulu0ylel0HoC/rCv0pU/PX+wNqucQbk00xjOw8ts02keX2z+ LQumHPWfSNrQpPh5u4L6XLAcc0RjEig6WkwHJQtjdEIoI+TNXtrDdQ09yrOg1dgQwz/kgnp4 oLBLjs2K3XMymsSorPCcInAoG3kchRadsmv69WX+YxIWPJaKOVrNs5K7Jf0nX9cyGe5Q1XuW 8sbk7IGKux9sy0S9HaYqU/w6HUhGl85522ogCeyZ7xnZoKuPthHGOBULsS8YD7BqYrxgNwnA JsDgefkSqLUiKixie/Tb8V/dSDpFfXEm6/ixNPm6iXZAza8ZVreQqQ3gs44VIHZi1c0qHtC0 kP2dx3IujfkdeUL3g2m84GMHmeG5Q207P50rqWk0kPzHRu5xMDQYICQ4wJwaEo69oM776sQ7 YxRXmFqzo2UX45rTVoq5xaOS0NRFteN4lJDB35911h5El46l+QARAQABtChTb3VyaXNoIEJh c3UgPFNvdXJpc2guQmFzdUBjb2xvcmFkby5lZHU+iQI4BBMBAgAiBQJWeNrBAhsjBgsJCAcD AgYVCAIJCgsEFgIDAQIeAQIXgAAKCRDdna2p+Lv9IC5NEADjZ9A1SZWzlN/trUcRIL9Vt2xZ oesJDGbv24gXUTbe7O0aSB6EfQCCBS2wRCjtvHGBaTEMbL0oGYTIBS2VZg/xL4LFXtKqwkWe 27Z+6erRGpIVCvO2gj5uVLBvi6MGrxc+TNfKSsH+6sxnL0lHZe2H9ptpn4+RXlSchQyH9x9D qf0o9C3iUxVthdwzfS6lpJsXnTM7DfLZz/2vr7eSfTYh106fQU++WUE4KcWcH/p/DG9R0hRg e2WJQ5oVuFS7tPKJuRyEBfhDuk92HAviLg/FgisfTNNRsrVaQJfBI3sDweTV/ueP7D9TqByF l+6Xl1h3gflMhTX6llQmTHXYtU30fjk8V8yjEr90mfpBbdnWbqbI0kqCqa4f+X1L0F20vSrf slhq0JWsZR96yljonXorW93aYu/4LCvO5AGtSx4LUTX7/jVM2DWQfa59/Ioqygz0V2EYDzQF poST5TznXPlsz+0kIUzUoLjv9+ES93idZt8rRNmklHdOyA9eEAIqv/b2mXSpYMjk7HMbxPgR h+jX7WiyfbFi12z4ApQR8MbHe5iZwTsrMhwlzoJEDp4tL3NcWf5nQoRnjZ+6i0zNzMy4T1Jp LIcK82J/Vam93eLXm4E8wJbz70VEuSGO8Ei1nevG4BwSNw9vbjo0t5GvCLt93RKBXsNaJ5y4 C9W/ewSg+rkCDQRWeNIEARAAypsoemKwvg4pZv54DXN/bkmWTgSiYHWQqrUhMyP5UTi3hWw6 yuXtcDJ8QlHE9TBzO+JIKmf5q8ueANV7Rj1XXk2HiqDLggHgFy7lT/Vjv/cxp7l36kSn9iFM Y/pkg5C297g/dOmuxP/igInh5tpkIHU9qbbAGjLmplR95MEowivJKPbgs6QIFGcfuHCxNz+w 9vgqG+oZmtG9yE34/vS651v/9qJc4WW2t/oywUCm5ti/FwLV0MJ7hXmK48DpTzAVo5bAwkWB ALFvIbgGShncg5Ubn2xxe2dkgUAdxhX6bWPA3P7mC+3xrHtV0uRCBbYDCDH8LOGPWKK0poRn iUcWlKY6PAGSiAXzBmgex3lv/EymYUHH4D1QJTxaoLpO/8O17AharvkuAD2Wi5s3j/9PiE5d ilxww5Df+43memityqJzFoFLgvlftXYsnQ5dsJGXOVhnf5IE+xzWnP/W5qTuDswlV8ZJ4OjT +KiZkePhirXiKLObcwpODZ97VCE0O1DHOWNfuvg6aQd97FHo51wRs2CI5SBa2xbpEhbwKu3p Py11Lkn0NQ/3qPrnKOs4bxb5nn+mUGkLMeQLantWmnWF8r7WxELkf+06jYliG6LTdCmVld6r PqW8E6/KQaZJXjRcbJ01b80IilyFCE9l2uA+ZgVieCHWHFuQ28+yNscvJVEAEQEAAYkCHwQY AQIACQUCVnjSBAIbDAAKCRDdna2p+Lv9IPQUEACPYzMYudTbWC9w615+fpW6kZdWXRByGCqJ G8fM2zkADi521ZH3nzWzdOjAxXZ94ujEUuNMeEBDlk4lmmb1i4jstyRWf5FJBqbGM52PiPMn 5mcI8GzIayvYMugDCoyMH1WGEI3lmQbIAr7kkyjLDbhTa74YmMvzgtmHRgDSHqHqAKCrBKde HqkvxEFu8clL50KRsUm45RU4BOupbzHnw2zxzhEmK1PJaJ5WqCMSX4icftGlkWNEJq3KmmSf JPmIO48ACXneslTzF3hRTslEreAHYvQJprYZDj3Cr1ttftCcrhs7L4Fz64BhYBoye+j78z4A 8EFbaWzGuZt1SRBvd8a2qiq9kj8g2FzzxqXsCluILwN9GNPZbYY2aXPemZcbbUYtqEA4yTRs vVvd68NXxIINTfXDlAYHr1DcCHvwqr+oZuG+J70zl1vVxjCl6BvdsG6VdMM0ag6RGhuRms81 EQj1oyFg3EBFJoJ/6vXV0rTTs+Yw6DsaFlNrM/xnGUf1hD5uh7utRyYfGpLssdJQzn/dVsRU iUPL5w35WZ/za2VJIs7Mv6f9DmxaRd6FrtCmc1GoXYspcj95ytrcFHKi+MviOUkhuEnOz/tB odfNf7h8Mkb5mVHONfWrFdIyF2ZqngD+Lx2YgITXdakyBq9WOFwGoyHblqQbO4PKxD38b7av /A==
Ken,

As someone who works with netcdf every day, I wouldn't want you to grid
your data if I were a user. Your data are basically 20,000 different
time series at 20,000 fixed lat/lon/elev locations. It's not inherently
on a grid, and there's no advantage to be had in gridding it into a
mostly empty hypercube. I think netcdf is very well suited for these
sorts of data as well, and in my work we exchange atmospheric
observation data -- also time series at discrete locations and times --
as netcdf files all the time. Given the simple data structure you've
gone for, it's easy enough for a user to write their own subsetting
script should they want to.

It's true that you see a lot of discussion in this forum about handling
gridded netcdf data with NCO and viewing them with panoply/ncview, but I
think that's just because netcdf tends to get used heavily in the world
of modelers -- be it land surface, atmosphere or ocean -- and models
typically output gridded fields. I personally wish more people with data
like yours would consider netcdf, so I'm really happy that you are :-)

Cheers,

Sourish

On 3/18/19 5:25 PM, Ken Mankoff wrote:
> I could always grid the data - I have 20,000 outlets around the coast of 
> Greenland at 90 m resolution. The total number of cells is 500 million. This 
> seems inefficient, but the 499,980,000 empty cells do compress quite well. It 
> is ~15 MB/day (5 GB/year). Maybe compressing multiple days improves on this? 
> 5 GB/year is tractable for 40 years of data.
>
> But as someone who works with NetCDF, wouldn't this data presentation make 
> you cringe? On the other hand, NetCDF works best when gridded, right?
>
>   -k.
>   
>
> On 2019-03-18 at 16:19 -0700, Gus Correa <gus@xxxxxxxxxxxxxxxxx> wrote...
>> Hi Ken
>>
>> That is true.
>> I suppose both CDO and NCO (ncks) assume the lat and lon are
>> monotonic (increasing or decreasing) coordinate variables, and that
>> runoff has (time,lat,lon) dimensions, not (time,ID).
>> ID is not a coordinate, it is just a label for your observation stations, I
>> guess.
>>
>> You could devise a more elaborate scheme to define lat and lon dimensions,
>> then,
>> lat(lat) and lon(lon) coordinate variables, and from there create a 3D
>> runoff(time,lat,lon) variable.
>> There are several hurdles though:
>> 1) The values of lat and lon in your CSV file may have repetitions (this
>> affects the actual lenght of each dimension, which may be <20000).
>> 2) The values of lat and lon in your CSV file may not be monotonically
>> ordered (either in increasing or decreasing order).
>> I didn't spot any repetitions in the sample file you sent (but the full
>> file may have repetitions),
>> but the lat and lon are definitely not monotonically ordered,
>> they can go up, then down, then up again ...
>> Bona fide coordinate variables must be monotonic.
>> 3) Even if you weed out repetitions in lat or lon, and sort them in
>> increasing or decreasing order,
>> you would have to exchange also the corresponding runoff values, so that
>> they continue to belong to the correct station/location/ID,
>> i.e. sort the whole file with (lat,lon) as primary and secondary keys.
>>
>> Maybe Python has a sort routine that does all that for you gracefully, some
>> variant of qsort perhaps.
>>
>> Gus
>>
>>
>> On Mon, Mar 18, 2019 at 6:50 PM Ken Mankoff <mankoff@xxxxxxxxx> wrote:
>>
>>> Hi Sourish, Gus, and Elizabeth,
>>>
>>> Thank you all for your suggestions. I think I've found something that
>>> works, except for one issue. Please excuse my likely incorrect use of
>>> terminology - being new to NetCDF creation I may say something incorrect,
>>> but I hope the data dump below speaks for itself.
>>>
>>> Because my data is 2D (time, ID), then those are the dimensions, and
>>> lon,lat,x,y become variables on the ID dimension. This means my standard
>>> netcdf tools for slicing based on spatial dimension don't work. For example,
>>>
>>> cdo sellonlatbox,83.5,85,-27,-28 ds.nc bar.nc
>>>
>>> or
>>>
>>> ncks -d lat,83.5,85 -d lon,-27,-28 ds.nc bar.nc
>>> # ncks: ERROR dimension lat is not in input file
>>>
>>> Is there a way to make the data 2D but have the 2nd dimension be
>>> (lon,lat)? Even if yes, I don't imagine the cdo and ncks tools would work
>>> on that dimension... Is there a cdo, nco, or ncks (or other) simple tool
>>> I'm missing that can work with this non-gridded data the way those tools do
>>> so easily work with gridded data?
>>>
>>>
>>> Anway, here is the Python xarray code I got working to produce the NetCDF
>>> file, reading in the 'foo.csv' from my previous email and generating ds.nc.
>>> Once I understood the NetCDF structure from the file Sourish provided, I
>>> was able to generate something similar using a higher level API - one that
>>> takes care of time units, calendar, etc. I leave out (x,y,elev) for brevity.
>>>
>>>
>>>   -k.
>>>
>>>
>>>
>>> df = pd.read_csv('foo.csv', index_col=0, header=[0,1,2,3,4,5])
>>> df.index = pd.to_datetime(df.index)
>>>
>>> # Build the dataset
>>> ds = xr.Dataset()
>>> ds['lon'] = (('ID'), df.columns.get_level_values('lon'))
>>> ds['lat'] = (('ID'), df.columns.get_level_values('lat'))
>>> ds['runoff'] = (('time', 'ID'), df.values)
>>> ds['ID'] = df.columns.get_level_values('ID')
>>> ds['time'] = df.index
>>>
>>> # Add metadata
>>> ds['lon'].attrs['units'] = 'Degrees East'
>>> ds['lon'].attrs['long_name'] = 'Longitude'
>>> ds['lat'].attrs['units'] = 'Degrees North'
>>> ds['lat'].attrs['long_name'] = 'Latitude'
>>> ds['runoff'].attrs['units'] = 'm^3/day'
>>> ds['ID'].attrs['long_name'] = 'Basin ID'
>>>
>>> ds.to_netcdf('ds.nc')
>>>
>>>
>>>
>>>
>>> And here is the ncdump of the file
>>>
>>>
>>>
>>>
>>>
>>> netcdf ds {
>>> dimensions:
>>>         ID = 10 ;
>>>         time = 5 ;
>>> variables:
>>>         string lon(ID) ;
>>>                 lon:units = "Degrees East" ;
>>>                 lon:long_name = "Longitude" ;
>>>         string lat(ID) ;
>>>                 lat:units = "Degrees North" ;
>>>                 lat:long_name = "Latitude" ;
>>>         double runoff(time, ID) ;
>>>                 runoff:_FillValue = NaN ;
>>>                 runoff:units = "m^3/day" ;
>>>                 runoff:long_name = "RACMO runoff" ;
>>>         string ID(ID) ;
>>>                 ID:long_name = "Basin ID" ;
>>>         int64 time(time) ;
>>>                 time:units = "days since 1980-01-01 00:00:00" ;
>>>                 time:calendar = "proleptic_gregorian" ;
>>>
>>> // global attributes:
>>>                 :Creator = "Ken Mankoff" ;
>>>                 :Contact = "kdm@xxxxxxx" ;
>>>                 :Institution = "GEUS" ;
>>>                 :Version = 0.1 ;
>>> data:
>>>
>>>  lon = "-27.983", "-27.927", "-27.894", "-28.065", "-28.093", "-28.106",
>>>     "-28.155", "-27.807", "-27.455", "-27.914" ;
>>>
>>>  lat = "83.505", "83.503", "83.501", "83.502", "83.501", "83.499",
>>> "83.498",
>>>     "83.485", "83.471", "83.485" ;
>>>
>>>  runoff =
>>>   0.023, 0.01, 0.023, 0.005, 0, 0, 0, 0, 0, 0,
>>>   0.023, 0.01, 0.023, 0.005, 0, 0, 0, 0, 0, 0,
>>>   0.024, 0.013, 0.023, 0.005, 0, 0, 0, 0, 0, 0,
>>>   0.025, 0.012, 0.023, 0.005, 0, 42, 0, 0, 0, 0,
>>>   0.023, 0.005, 0.023, 0.005, 0, 0, 0, 0, 0, 0 ;
>>>
>>>  ID = "1", "2", "5", "8", "9", "10", "12", "13", "15", "16" ;
>>>
>>>  time = 0, 1, 2, 3, 4 ;
>>> }
>>>
>>> _______________________________________________
>>> NOTE: All exchanges posted to Unidata maintained email lists are
>>> recorded in the Unidata inquiry tracking system and made publicly
>>> available through the web.  Users who post to any of the lists we
>>> maintain are reminded to remove any personal information that they
>>> do not want to be made public.
>>>
>>>
>>> netcdfgroup mailing list
>>> netcdfgroup@xxxxxxxxxxxxxxxx
>>> For list information or to unsubscribe,  visit:
>>> http://www.unidata.ucar.edu/mailing_lists/
>>>

Attachment: signature.asc
Description: OpenPGP digital signature

  • 2019 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: