[netcdf-java] ucar.nc2.FileWriter bug fix: copySome actually copies everything at once

Hello all,

In NJ 4.2.20100409.0054 the method ucar.nc2.FileWriter.copySome() is supposed to copy data for a large variable in a series of small chunks. As written, however, it actually attempts to copy everything at once. (Here's hoping that Thunderbird preserves the whitespace in my preformatted code samples, or else this post will be very tough to follow.)

  private static void copySome(NetcdfFileWriteable ncfile, Variable oldVar, int 
nelems) throws IOException {
    String newName = N3iosp.makeValidNetcdfObjectName( oldVar.getName());

    int[] shape = oldVar.getShape();
    int[] origin = new int[oldVar.getRank()];
    int size = shape[0];

    for (int i = 0; i<  size; i += nelems) {
      origin[0] = i;
      int left = size - i;
      shape[0] = Math.min(nelems, left);

      Array data;
      try {
        data = oldVar.read(origin, shape);
        ...


I'm not exactly sure what the intended logic was, but it's clear that in the first iteration of the loop, origin will be all zeroes (e.g. {0, 0, 0} if the rank of oldVar is 3) and that shape will be identical to oldVar's shape. Therefore, the code will attempt to read all of oldVar's data at once and an OutOfMemoryError will result if oldVar is large.

On 03/29/2010, Robert Bridle proposed some new code for FileWriter.copyVarData() (which calls copySome()) that would split the write job into chunks:

///////////// ORIGINAL CODE //////////////////////
      int nelems = (int) (size / maxSize);
      if (nelems<= 1)
        copyAll(ncfile, oldVar);
      else
        copySome(ncfile, oldVar, nelems);
////////////////////////////////////////////////////

////////////// PROPOSED CODE //////////////////////
/*      if(size>  maxSize)
      {
        int[] shape = oldVar.getShape();

        // determine the size of all the dimensions, other than the first.
        long sizeOfOtherDimensions = 1;
        for (int i = 1; i<  shape.length; i++) {
          if (shape[i]>= 0)
            sizeOfOtherDimensions *= shape[i];
        }

        // determine number of bytes in all the dimensions, other than the 
first.
        long bytesInOtherDimensions = sizeOfOtherDimensions * 
oldVar.getElementSize();

        // first dimension chunk-size that will fit within maxSize of memory.
        int firstDimensionChunkSize = (int) (maxSize/bytesInOtherDimensions);
        //System.out.println("We can fit: " + firstDimensionChunkSize + " chunks in: " + 
maxSize + " bytes of memory.");

        copySome(ncfile, oldVar, firstDimensionChunkSize);
      }
      else
      {
        copyAll(ncfile, oldVar);
      }    */
////////////////////////////////////////////////////


This will write the data in N chunks where N is the size of the outer-most dimension. But what about when the stride of the outer dimension is very large? For example, there's a variable from a massive aggregated dataset I'm working with that has the CDL:

   float pr(ensemble=8, time=1560, lat=128, lon=256);


which means an outer-most dimension stride of 1560*128*256 = 51,118,080. Using 32-bit floats, that would require 195 MB to store--quite a bit larger than the maxSize of 1 MB.

So, I propose a different algorithm:

    /**
     * An index that computes chunk shapes. It is intended to be used to 
compute the origins and shapes for a series
     * of contiguous writes to a multidimensional array.
     */
    public static class ChunkingIndex extends Index {
        public ChunkingIndex(int[] shape) {
            super(shape);
        }

        /**
         * Computes the shape of the largest possible<b>contiguous</b>  chunk, 
starting at {@link #getCurrentCounter()}
         * and with {@code size<= maxChunkSize}.
         *
         * @param maxChunkSize  the maximum size of the chunk shape. The actual 
size of the shape returned is likely
         *                      to be different, and can be found with {@link 
Index#computeSize}.
         * @return  the shape of the largest possible contiguous chunk.
         */
        public int[] computeChunkShape(int maxChunkSize) {
            int[] chunkShape = new int[rank];

            for (int iDim = 0; iDim<  rank; ++iDim) {
                chunkShape[iDim] = maxChunkSize / stride[iDim];
                chunkShape[iDim] = (chunkShape[iDim] == 0) ? 1 : 
chunkShape[iDim];
                chunkShape[iDim] = Math.min(chunkShape[iDim], shape[iDim] - 
current[iDim]);
            }

            return chunkShape;
        }
    }

    private static void copySome(NetcdfFileWriteable ncfile, Variable oldVar, 
int nelems) throws IOException {
        String newName = N3iosp.makeValidNetcdfObjectName(oldVar.getName());

        ChunkingIndex index = new ChunkingIndex(oldVar.getShape());
        while (index.currentElement()<  index.getSize()) {
            try {
                int[] chunkOrigin = index.getCurrentCounter();
                int[] chunkShape  = index.computeChunkShape(nelems);
                Array data = oldVar.read(chunkOrigin, chunkShape);

                if (oldVar.getDataType() == DataType.STRING) {
                    data = convertToChar(ncfile.findVariable(newName), data);
                }

                if (data.getSize()>  0) {// zero when record dimension = 0
                    ncfile.write(newName, chunkOrigin, data);
                    if (debugWrite) {
                        System.out.println("write " + data.getSize() + " 
bytes");
                    }
                }

                index.setCurrentCounter(index.currentElement() + (int) 
Index.computeSize(chunkShape));
            } catch (InvalidRangeException e) {
                e.printStackTrace();
                throw new IOException(e.getMessage());
            }
        }
    }


This will result in chunks that are *always* smaller than nelems, regardless of oldVar's size or shape. For example, if oldVar.getShape() == { 5, 16, 8 } and nelems = 100, the origins and shapes of the chunk read/writes will be:

     origin      shape       size
r/w: [0, 0, 0] , [1, 12, 8], 96
r/w: [0, 12, 0], [1, 4, 8] , 32
r/w: [1, 0, 0] , [1, 12, 8], 96
r/w: [1, 12, 0], [1, 4, 8] , 32
r/w: [2, 0, 0] , [1, 12, 8], 96
r/w: [2, 12, 0], [1, 4, 8] , 32
r/w: [3, 0, 0] , [1, 12, 8], 96
r/w: [3, 12, 0], [1, 4, 8] , 32
r/w: [4, 0, 0] , [1, 12, 8], 96
r/w: [4, 12, 0], [1, 4, 8] , 32


As you can see, none of the chunks is actually 100 elements in size, but given the constraints of the Netcdf API, I don't think it can be helped. We'd need to be able to read and write 1D Arrays of values from/to a specific offset in the 1D backing array.

If you're interested, I've attached a patch containing the changes.

Regards,
Christian Ward-Garrison
--- C:/Documents and Settings/cwardgar/Desktop/origFileWriter.java      Tue Apr 
 6 12:20:34 2010
+++ C:/Documents and Settings/cwardgar/Desktop/newFileWriter.java       Fri Apr 
 9 03:16:37 2010
@@ -344,35 +344,64 @@
     }
   }
 
-  private static void copySome(NetcdfFileWriteable ncfile, Variable oldVar, 
int nelems) throws IOException {
-    String newName = N3iosp.makeValidNetcdfObjectName( oldVar.getName());
+    /**
+     * An index that computes chunk shapes. It is intended to be used to 
compute the origins and shapes for a series
+     * of contiguous writes to a multidimensional array.
+     */
+    public static class ChunkingIndex extends Index {
+        public ChunkingIndex(int[] shape) {
+            super(shape);
+        }
 
-    int[] shape = oldVar.getShape();
-    int[] origin = new int[oldVar.getRank()];
-    int size = shape[0];
+        /**
+         * Computes the shape of the largest possible <b>contiguous</b> chunk, 
starting at {@link #getCurrentCounter()}
+         * and with {@code size <= maxChunkSize}.
+         *
+         * @param maxChunkSize  the maximum size of the chunk shape. The 
actual size of the shape returned is likely
+         *                      to be different, and can be found with {@link 
Index#computeSize}.
+         * @return  the shape of the largest possible contiguous chunk.
+         */
+        public int[] computeChunkShape(int maxChunkSize) {
+            int[] chunkShape = new int[rank];
 
-    for (int i = 0; i < size; i += nelems) {
-      origin[0] = i;
-      int left = size - i;
-      shape[0] = Math.min(nelems, left);
+            for (int iDim = 0; iDim < rank; ++iDim) {
+                chunkShape[iDim] = maxChunkSize / stride[iDim];
+                chunkShape[iDim] = (chunkShape[iDim] == 0) ? 1 : 
chunkShape[iDim];
+                chunkShape[iDim] = Math.min(chunkShape[iDim], shape[iDim] - 
current[iDim]);
+            }
 
-      Array data;
-      try {
-        data = oldVar.read(origin, shape);
-        if (oldVar.getDataType() == DataType.STRING) {
-          data = convertToChar(ncfile.findVariable(newName), data);
+            return chunkShape;
         }
-        if (data.getSize() > 0)  {// zero when record dimension = 0
-          ncfile.write(newName, origin, data);
-          if (debugWrite) System.out.println("write "+data.getSize()+" bytes");
+    }
+
+    private static void copySome(NetcdfFileWriteable ncfile, Variable oldVar, 
int nelems) throws IOException {
+        String newName = N3iosp.makeValidNetcdfObjectName(oldVar.getName());
+
+        ChunkingIndex index = new ChunkingIndex(oldVar.getShape());
+        while (index.currentElement() < index.getSize()) {
+            try {
+                int[] chunkOrigin = index.getCurrentCounter();
+                int[] chunkShape  = index.computeChunkShape(nelems);
+                Array data = oldVar.read(chunkOrigin, chunkShape);
+
+                if (oldVar.getDataType() == DataType.STRING) {
+                    data = convertToChar(ncfile.findVariable(newName), data);
+                }
+
+                if (data.getSize() > 0) {// zero when record dimension = 0
+                    ncfile.write(newName, chunkOrigin, data);
+                    if (debugWrite) {
+                        System.out.println("write " + data.getSize() + " 
bytes");
+                    }
+                }
+
+                index.setCurrentCounter(index.currentElement() + (int) 
Index.computeSize(chunkShape));
+            } catch (InvalidRangeException e) {
+                e.printStackTrace();
+                throw new IOException(e.getMessage());
+            }
         }
-
-      } catch (InvalidRangeException e) {
-        e.printStackTrace();
-        throw new IOException(e.getMessage());
-      }
     }
-  }
 
   private static Array convertToChar(Variable newVar, Array oldData) {
     ArrayChar newData = (ArrayChar) Array.factory(DataType.CHAR, 
newVar.getShape());
  • 2010 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdf-java archives: