Unidata Developer's BlogUnidata Developer's Bloghttps://www.unidata.ucar.edu/blogs/developer/en/feed/entries/atom2019-06-24T06:22:59-06:00Apache Rollerhttps://www.unidata.ucar.edu/blogs/developer/entry/metpy-mondays-85-named-tuplesMetPy Mondays #85 - Named TuplesJohn Leeman2019-06-24T06:23:00-06:002019-06-24T06:23:00-06:00<p>This week discover the named tuple and where it can be used in your Python projects!</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/uVEeBXzxYXM" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
https://www.unidata.ucar.edu/blogs/developer/entry/metpy-mondays-84-fileinput-andMetPy Mondays #84 - fileinput and multiple filesJohn Leeman2019-06-17T06:00:00-06:002019-06-17T06:00:00-06:00<p>This week learn about the builtin fileinput module and how to use it to automatically parse across multiple files.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/WFFexMM5qdA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
https://www.unidata.ucar.edu/blogs/developer/entry/metpy-user-surveyMetPy User SurveyRyan May2019-06-11T12:45:27-06:002019-06-11T12:45:27-06:00<p>The MetPy development team would appreciate people taking the <a href="https://forms.gle/cJkpFx31SnG9JULb8" title="2019 MetPy User Survey">2019 MetPy User Survey</a>.</p>
<p>If you've used MetPy and have 5-10 minutes to spare, the MetPy development team would greatly appreciate it if you could take the <a href="https://forms.gle/cJkpFx31SnG9JULb8" title="2019 MetPy User Survey">2019 MetPy User Survey</a>. This will greatly help steer the direction of MetPy development and assist in reporting back to NSF on the progress of the project.</p>
https://www.unidata.ucar.edu/blogs/developer/entry/metpy-mondays-83-loggingMetPy Mondays #83 - LoggingJohn Leeman2019-06-10T07:02:09-06:002019-06-10T07:02:09-06:00<p>Stop putting debug print statements in your code with the logging module!</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/5FEHUOXklCI" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
https://www.unidata.ucar.edu/blogs/developer/entry/metpy-mondays-82-multiprocessingMetPy Mondays #82 - MultiprocessingJohn Leeman2019-06-03T05:35:33-06:002019-06-03T05:35:33-06:00<p>Multiprocessing is a quick way to speed up batch processing work. Learn the basics on this week's MetPy Monday!</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/i_P98aBiTsg" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
https://www.unidata.ucar.edu/blogs/developer/entry/metpy-mondays-81-managing-environmentsMetPy Mondays #81 - Managing EnvironmentsJohn Leeman2019-05-27T04:09:49-06:002019-05-27T04:09:49-06:00<p>Environments are a useful tool for your Python based life, but managing them can be tricky. Learn the ins and outs on this week's MetPy Monday!</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/nzkJs-niH6Q" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
https://www.unidata.ucar.edu/blogs/developer/entry/chunking-algorithms-for-netcdf-cChunking Algorithms for NetCDF-CDennis Heimbigner 2019-05-22T14:14:54-06:002019-05-22T14:14:54-06:00<p>Unidata is in the process of developing a Zarr [] based variant
of netcdf. As part of this effort, it was necessary to
implement some support for chunking. Specifically, the problem
to be solved was that of extracting a hyperslab of data from an
n-dimensional variable (array in Zarr parlance) that has been divided
into chunks (in the HDF5 sense). Each chunk is stored independently
in the data storage -- Amazon S3, for example.</p>
<p>The algorithm takes a series of R slices of the form (first,stop,stride),
where R is the rank of the variable. Note that a slice of the form
(first, count, stride), as used by netcdf, is equivalent because
stop = first + count*stride. These slices form a hyperslab.</p>
<p>The goal is to compute the set of chunks that intersect the hyperslab
and to then extract the relevant data from that set of chunks to
produce the hyperslab.</p>
<h1>Introduction</h1>
<p>Unidata is in the process of developing a Zarr [] based variant
of netcdf. As part of this effort, it was necessary to
implement some support for chunking. Specifically, the problem
to be solved was that of extracting a hyperslab of data from an
n-dimensional variable (array in Zarr parlance) that has been divided
into chunks (in the HDF5 sense). Each chunk is stored independently
in the data storage -- Amazon S3, for example.</p>
<p>The algorithm takes a series of R slices of the form (first,stop,stride),
where R is the rank of the variable. Note that a slice of the form
(first, count, stride), as used by netcdf, is equivalent because
stop = first + count*stride. These slices form a hyperslab.</p>
<p>The goal is to compute the set of chunks that intersect the hyperslab
and to then extract the relevant data from that set of chunks to
produce the hyperslab.</p>
<p>It appears from web searches that this algorithm is nowhere documented
in the form of a high level pseudo code algorithm. It appears to only
exist in the form of code in HDF5, Zarr, and probably TileDB, and maybe
elsewhere.</p>
<p>What follows is an attempt to reverse engineer the algorithm used by
the Zarr code to compute this intersection. This is intended to be
the bare-bones algorithm with no optimizations included. Thus, its
performance is probably not the best, but it should work in all cases.
Some understanding of how HDF5 chunking is used is probably essential
for understanding this algorithm.</p>
<p>The original python code relies heavily on the use of Python
iterators, which was might be considered a mistake. Iterators
are generally most useful in two situations: (1) it makes
the code clearer -- arguably false in this case, and (2) there is a reasonable
probability that the iterators will be terminated before they end, thus
improving efficiency. This is demonstrably false for the Zarr code.</p>
<p>This code instead uses the concept of an <em>odometer</em>,
which is a way to iterate over all the elements of an
n-dimensional object. Odometer code already exists in several
places in the existing netcdf-c library, among them
<em>ncdump/nciter.c</em>, <em>ncgen/odom.c</em>, and <em>libdap4/d4odom.c</em>. It
is also equivalent to the Python <em>itertools.product</em> iterator
function.</p>
<p>This algorithm assumes the following items of input:
1. variable (aka array) - multidimensional array of data values
2. dimensions - the vector of dimensions defining the rank of the array;
this set comes from the dimensions associated with the array;
so given v(d1,d2) where (in netcdf terms d1 and d2 are the
dimension names and where for example, d1=10, d2=20
the dimensions given to this algorithm are the ordered vector (d1,d2),
or equivalently (10,20).
3. Chunk sizes - for each dimension in the dimension set, there is defined
a chunk size along that dimension. Thus, there is also an ordered vector
of chunk sizes, (c1,c2) corresponding to the dimension vector (d1,d2).
4. slices - a vector of slice instances defining the subset of data to be
extracted from the array. A single slice as used here is of the form
(start,stop,stride) where start is the starting position with respect to
a dimension, stop is the last position + 1 to extract, and stride is
the number of positions to skip when extracting data. Note that this
is different than the netcdf-c nc<em>get</em>vars slices, which are of the form
(start,count,stride). The two are equivalent since stop = start+(count*stride).
When extracting data from a variable of rank R, one needs to specify R slices
where each slice corresponds to a dimension of the variable.</p>
<p>At a high-level, the algorithm works by separately analyzing
each dimension of the array using the corresponding slice and
corresponding chunk size. The result is a set of <em>projections</em>
specific to each dimension. By taking the cross-product of these
projections, one gets a vector of projection-vectors that can be
evaluated to extract a subset of the desired data for storage in
an output array.</p>
<p>It is important to note that this algorithm operates
in two phases. In the first phase, it constructs the projections
for each dimension. In the second phase, these projections
are combined as a cross-product to provide subsetting for the
true chunks, which are R-dimensional rectangles.</p>
<p>What follows is the algorithm is written as a set of pseudo-code procedures
to produce the final output given the above inputs.</p>
<h2>Notations:</h2>
<ul>
<li>floordiv(x,y) = floor((x / y))</li>
<li><p>ceildiv(x,y) = ceil((x / y))</p>
<h2>Notes:</h2></li>
<li><p>The ith slice is matched to the ith dimension for the variable</p></li>
<li>The ith slice is matched to the ith chunksize for the variable</li>
<li>The zarr code uses iterators, but this code converts to using
vectors and odometers for (one hopes) some clarity and consistency
with existing netcdf-c code.</li>
</ul>
<h2>Global Type Declarations</h2>
<pre><code>class Slice {
int start
int stop
int stride
}
class SliceIndex { // taken from zarr code see SliceDimIndexer
int chunk0 // index of the first chunk touched by the slice
int nchunks // number of chunks touched by this slice index
int count //total number of output items defined by this slice index
Projection projections[nchunks]; // There are multiple projections
// derived from the original slice:
// one for each chunk touched by
// the original slice
}
class Projection {
int chunkindex;
Slice slice; // slice specific to this chunk
int outpos; // start point in the output to store the extracted data
}
</code></pre>
<h2>Global variables</h2>
<p>In order to keep argument lists short, certain values are
assumed to be globally defined and accessible.
* R - the rank of the variable
* dimlen[R] - the length of the dimensions associated with the variable
* chunklen[R] - the length of the chunk sizes associated with the variable
* int zeros[R] - a vector of zeros
* int ones[R] - a vector of ones</p>
<h2>Procedure EvaluateSlices</h2>
<pre><code>// Goal: Given the projections for each slice being applied to the
// variable, create and walk all possible combinations of projection
// vectors that can be evaluated to provide the output data
void EvaluateSlices(
Slice slices[R], // the complete set of slices
T output // the target storage for the extracted data: its type is T
)
{
int i;
SliceIndex allsliceindices[R];
Odometer odometer;
nchunks[R]; // the vector of nchunks from projections
int indices[R];
// Compute the slice index vector
allsliceindices = compute_all_slice_indices(slices);
// Extract the chunk0 and nchunks vectors
for(i=0;i<R;i++) {
nchunks[i] = allsliceindices[i].nchunks;
}
// Create an odometer to walk nchunk combinations
odometer = Odometer.new(R,zeros,nchunks); // iterate each "wheel[i]" over 0..nchunk[i] with R wheels
// iterate over the odometer: all combination of chunk indices in the projections
for(;odometer.more();odometer.next()) {
chunkindices = odometer.indices();
ApplyChunkIndices(chunkindices,output,allsliceindices);
}
}
</code></pre>
<h2>Procedure ApplyChunkIndices</h2>
<pre><code>// Goal: given a vector of chunk indices from projections,
// extract the corresponding data and store it into the
// output target
void ApplyChunkIndices(
int chunkindices[R], // indices chosen by the parent odometer
T output, // the target storage for the extracted data
SliceIndex allsliceindices[R]
)
{
int i;
SliceIndex subsliceindices[R];
chunk0[R]; // the vector of chunk0 values from projections
Projection projections[R];
int[R] start, stop,stride; // decomposed set of slices
int[R] outpos; // capture the outpos values across the projections
int ouputstart;
// This is complicated. We need to construct a vector of slices
// of size R where the ith slice is determined from a projection
// for the ith chunk index of chunkindices. We then iterate over
// that odometer to extract values and store them in the output.
for(i=0;i<R;i++) {
int chunkindex = chunkindices[i];
Slice slices[R];
projections[i] = allsliceindices[i].projections[chunkindex];
slices[i] = projections[i].slice;
outpos[i] = projections[i].outpos;
}
// Compute where the extracted data will go in the output vector
outputstart = computelinearoffset(R,outpos,dimlen);
GetData(slices,outputstart,output);
}
</code></pre>
<h2>Procedure GetData</h2>
<pre><code>// Goal: given a set of indices pointing to projections,
// extract the corresponding data and store it into the
// output target.
void GetData(
Slice slices[R],
int chunksize, // total # T instances in chunk
T chunk[chunksize],
int outputstart,
T output
)
{
int i;
Odometer sliceodom,
sliceodom = Odometer.new(R, slices);
// iterate over the odometer to get a point in the chunk space
for(;odom.more();odom.next()) {
int chunkpos = odometer.indices(); // index
}
}
</code></pre>
<h2>Procedure compute_all_slice_projections</h2>
<pre><code>// Goal:create a vector of SliceIndex instances: one for each slice in the top-level input
Projection[R]
compute_all_slice_projections(
Slice slice[R], // the complete set of slices
{
int i;
SliceIndex projections[R];
* for i in 0..R-1
* projections[i] = compute_perslice_projections(dimlen[i],chunklen[i],slice[i])
* return projections
</code></pre>
<h2>Procedure compute_perslice_projections</h2>
<h3>Goal:</h3>
<ul>
<li><p>For each slice, compute a set of projections from it wrt a
dimension and a chunk size associated with that dimension.</p>
<h3>Inputs:</h3></li>
<li><p>dimlen -- dimension length </p></li>
<li>chunklen -- chunk length associated with the input dimension</li>
<li><p>slice=(start,stop,stride) -- associated with the dimension</p>
<h3>Outputs:</h3></li>
<li><p>Instance of SliceIndexs</p></li>
</ul>
<h3>Data Structure:</h3>
<p>````</p>
<h3>Computations:</h3>
<ul>
<li>count = max(0, ceildiv((stop - start), stride))
<ul><li>total number of output items defined by this slice (equivalent to count as used by nc<em>get</em>vars)</li></ul></li>
<li>nchunks = ceildiv(dim<em>len, dim</em>chunk_len)
<ul><li>number of chunks touched by this slice</li></ul></li>
<li>chunk0 = floordiv(start,chunklen)
<ul><li>index (in 0..nchunks-1) of the first chunk touched by the slice</li></ul></li>
<li>chunkn = ceildiv(stop,chunklen)
<ul><li>index (in 0..nchunks-1) of the last chunk touched by the slice</li></ul></li>
<li>n = ((chunkn - chunk0) + 1)
<ul><li>total number of touched chunks</li>
<li>the index i will range over 0..(n-1)</li>
<li>is this value the same as nchunks?</li></ul></li>
<li>For each touched chunk index we compute a projection specific to that chunk, hence
there are n of them.</li>
<li>projections.index[i] = i</li>
<li>projections.offset[i] = chunk0 * i
<ul><li>remember: offset is only WRT this dimension, not global</li></ul></li>
<li>projections.limit[i] = min(dimlen, (i + 1) * chunklen)
<ul><li>end of this chunk but no greater than dimlen</li></ul></li>
<li>projections.len[i] = projections.limit[i] - projections.offset[i]
<ul><li>actual limit of the ith touched chunk; should be same as chunklen except for last length because of the min function in computing limit[i]</li></ul></li>
<li>projections.start[i]:
<ul><li>This is somewhat complex because for the first projection, the start is the slice start,
but after that, we have to take into account that for a non-one stride, the start point
in a projection may be offset by some value in the range of 0..(stride-1)</li>
<li>i == 0 => projections.start[i] = start - projections.offset[i]
<ul><li>initial case the original slice start is within the first projection</li></ul></li>
<li>i > 0 => projections.start[i] = start - projections.offset[i]
<ul><li>prevunused[i] = (projections.offset[i] - start) % stride
<ul><li>prevunused[i] is an intermediate computation and need not be saved</li>
<li>amount unused in previous chunk => we need to skip (stride-prevunused[i]) in this chunk</li></ul></li>
<li>prevunused[i] > 0 => projections.start[i] = stride - prevunused[i]</li></ul></li></ul></li>
<li>projections.stop[i]:
<ul><li>stop > projections.limit[i] => projections.stop[i] = projections.len[i]</li>
<li>stop <= projections.limit[i] => projections.stop[i] = stop - projections.offset[i]
<ul><li>selection ends within current chunk</li></ul></li></ul></li>
<li>projections.outpos[i] = ceildiv(offset - start, stride)
<ul><li>"location" in the output array to start storing items; again, per slice, not global</li></ul></li>
</ul>
<h2>Procedure computelinearoffset(outpos,dimlen);</h2>
<p>````
// Goal: Given a set of per-dimension indices, compute the corresponding linear position.</p>
<pre><code>int
computelinearoffset(int R,
int outpos[R],
int dimlen[R]
)
{
int offset;
int i;
offset = 0;
for(i=0;i<R;i++) {
offset *= dimlen[i];
offset += outpos[i];
}
return offset;
}
</code></pre>
<h1>Appendix: Odometer Code</h1>
<pre><code>class Odometer
{
int R; // rank
int start[R];
int stop[R]
int stride[R];
int index[R]; // current value of the odometer
procedure new(int R, int start[R], int stop[R]) { return new(R, start,stop,ones);}
procedure new(int rank, Slice slices[R])
{
int i;
int start0[R];
int stop0[R];
int stride0[R];
R = rank;
for(i=0;i<R;i++) {
start = slices[i].start;
stop = slices[i].stop;
stride = slices[i].stride;
}
for(i=0;i<R;i++) {index[i] = start[i];}
}
boolean
procedure more(void)
{
return (index[0] < stop[0]);
}
procedure next(void)
{
int i;
for(i=R-1;i>=0;i--) {
index[i] += stride[i];
if(index[i] < stop[i]) break;
if(i == 0) break; // leave the 0th entry if it overflows
index[i] = start[i]; // reset this position
}
}
// Get the value of the odometer
int[R]
procedure indices(void)
{
return indices;
}
}
</code></pre>
https://www.unidata.ucar.edu/blogs/developer/entry/metpy-mondays-80-type-hintingMetPy Mondays #80 - Type HintingJohn Leeman2019-05-20T07:16:14-06:002019-05-20T07:16:14-06:00<p>This week we explore type hinting and how it can make your code more reader friendly.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/aC9MA7lvuOo" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
https://www.unidata.ucar.edu/blogs/developer/entry/metpy-mondays-79-function-documentationMetPy Mondays #79 - Function DocumentationJohn Leeman2019-05-14T20:47:09-06:002019-05-14T20:47:09-06:00<p>Documenting isn't always fun, but it's the best way to be nice to your future self! Find out how in this week's MetPy Monday!</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/zr9MDnib4yU" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
https://www.unidata.ucar.edu/blogs/developer/entry/metpy-mondays-78-path-bestMetPy Mondays #78 - Path Best PracticesJohn Leeman2019-05-06T20:48:22-06:002019-05-06T20:48:22-06:00<iframe width="560" height="315" src="https://www.youtube.com/embed/dbKJ7nIv878" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>