Showing entries tagged [ar-4]

Why I Don't Think the Number of Processors Affects the Wall Clock Time of My Tests...

Using the taskset command I demonstrate that my benchmarks run on one processor.

Recently Russ raised a question: are my wall clock times wrong because I have many processors on my machine. I believe that the answer is no.

Firstly, without special efforts, linux has no way of breaking a process onto more than one processor. When I run my benchmarking program, tst_ar4, I see from the top command than one processor goes to > 90% use, and all the others remain at 0%.

Secondly, I confirmed this with the taskset command, which I had never heard of before. It limits a process to one (or any other number) of processors. In fact, it lets you pick which processors are used. Here's some timing results that show I get about the same times using the taskset command as I do without it, on a compressed file read:

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 -h -c 35000000 \
bash-3.2$ pr_A1_z1_64_128_256.nc
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
64    128   256    33.4      1       0       304398    4568

bash-3.2$ sudo ./clear_cache.sh && taskset -c 4 ./tst_ar4 -h -c 35000000 \
bash-3.2$ pr_A1_z1_64_128_256.nc
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
64    128   256    33.4      1       0       306810   4553

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 -h -c 35000000 \
bash-3.2$ pr_A1_z1_64_128_256.nc
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
64    128   256    33.4      1       0       292707   4616

bash-3.2$ sudo ./clear_cache.sh && taskset -c 4 ./tst_ar4 -h -c 35000000 \
bash-3.2$ pr_A1_z1_64_128_256.nc
cs[0] cs[1] cs[2]  cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
64    128   256    33.4      1       0       293713   4567

Large-Enough Cache Very Important When Reading Compressed NetCDF-4/HDF5 Data

The HDF5 chunk cache must be large enough to hold an uncompressed chunk.

Here's some test runs showing that a large enough cache is very important when reading compressed data. If the chunk cache is not big enough, then the data have to be deflated again and again.

The first run below uses the default 1MB chunk cache. The second uses a 16 MB cache. Note that the times to read the first time step are comparable, but the run with the large cache has a much lower average time, because each chunk is only uncompressed one time.

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 pr_A1_z1_64_128_256.nc -h
cs[0] cs[1] cs[2] cache(MB) deflate shuffle 1st_read_hor(us)   avg_read_hor(us)
64    128   256   1.0       1       0       387147             211280

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 pr_A1_z1_64_128_256.nc -h \
bash-3.2$ -c 16000000 pr_A1_z1_64_128_256.nc
s[0] cs[1] cs[2] cache(MB)  deflate shuffle 1st_read_hor(us)   avg_read_hor(us)
64   128   256   15.3       1       0       320176             4558

For comparison, here's the time for the netCDF-4/HDF5 file which is not compressed:

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 -h pr_A1_64_128_256.nc
cs[0] cs[1] cs[2] cache(MB)  deflate shuffle 1st_read_hor(us)  avg_read_hor(us)
64    128   256   1.0        0       0       459               1466

And here's the same run on the classic netCDF version of the file:

bash-3.2$ sudo ./clear_cache.sh && ./tst_ar4 -h \
bash-3.2$ pr_A1.20C3M_8.CCSM.atmm.1870-01_cat_1999-12.nc
cs[0] cs[1] cs[2] cache(MB)  deflate shuffle 1st_read_hor(us)  avg_read_hor(us)
0     0     0     0.0        0       0       2172              1538

So the winner is NetCDF-4/HDF5 for performance, with the best read time for the first time step, and the best average read time. Next comes the netCDF classic file, then the netCDF-4/HDF5 compressed file, which takes two order of magnitude longer than the classic file for the first time step, but then catches up so that the average read time is only 4 time slower than the classic file.

The file sizes show that this read penalty is probably not worth it:

pr_A1.20C3M_8.CCSM.atmm.1870-01_cat_1999-12.nc    204523236
pr_A1_z1_64_128_256.nc                          185543248
pr_A1_64_128_256.nc                               209926962

So the compressed NetCDF-4/HDF5 file saves only 20 MB out of about 200, about 10%.

The uncompressed NetCDF-4/HDF5 file is 5 MB larger than the classic file, or about 2.5% larger. 

The Point of All These Tests

It's all about finding a good set of default chunk sizes for netCDF-4.1

Tests seem to be indicating that, for the 3D data, a chunk size of 32 or 64 for the unlimited dimension provides a good trade-off in performance for time series and time step reads, without inflating the file size too much.

This makes intuitive sense as well. Larger chunk sizes mean that any left over chunks (i.e. chunks that are only partially filled with data) are going to take up more space on the disk and make the file bigger.

Here's some numbers from the latest tests. The top test is the netCDF classic format case. These are the time step reads.

cs[0] cs[1] cs[2] cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
0     0     0     0.0       0       0       35974           3125
32    64    128   1.0       0       0       261893          2931
32    64    256   1.0       0       0       132380           3563
32    128   128   1.0       0       0       151692           3657
32    128   256   1.0       0       0       8063             2219
64    64    128   1.0      0       0       133339           4264
64    64    256   1.0       0       0       28208            3359
64    128   128   1.0       0       0       27536            3051
64    128   256   1.0       0       0       110620           2043

Here are the time series reads:

cs[0] cs[1] cs[2] cache(MB) deflate shuffle 1st_read_ser(us) avg_read_ser(us)
0     0     0     0.0       0       0       3257952          8795
32    64    128   1.0       0       0       1427863          15069
32    64    256   1.0       0       0       2219838          4394
32    128  128   1.0       0       0       2054724          4668
32    128   256   1.0       0       0       3335330          4347
64    64    128   1.0       0       0       1041324          3581
64    64    256   1.0       0       0       1893643          2995
64    128   128   1.0       0       0       1942810         3024
64    128   256   1.0       0       0       3210923         3975

For the time series test, we see that smaller chunk sizes for the horizontal dimensions work better, and larger chunk sizes for the time dimension work better.

For the horizontal read we see that larger chunk sizes for the horizontal dimensions work better, and small chunk sizes along the time dimension.

Maybe the answer *is* to go with the current default scheme, but just make the sizes of the chunks that it writes much bigger.

I would really like 64 x 64 x 128 for the data above, except for the (possibly spurious) high value for the first horizontal read in that case.

Narrowing In On Correct Chunksizes For the 3D AR-4 Data

We're getting there...

It seems clear that Quincey's original advice is good: use large, squarish chunks.

My former scheme of default chunk sizes worked not terribly for the innermost dimensions (it used the full length of the dimension), but the use of a chunksize of 1 for unlimited dimensions was a bad one for read performance.

Here's some read numbers for chunksizes in what I believe is the correct range of chunksizes:

cs[0] cs[1] cs[2] cache(MB) deflate shuffle 1st_read_hor(us) avg_read_hor(us)
0     0     0     0.0       0       0       7087             1670
64    128   256   1.0       0       0       510              1549
128   128   256   1.0       0       0       401              1688
256   128   256   1.0       0       0       384              1679
64    128   256   1.0       1       0       330548           211382
128   128   256   1.0       1       0       618035           420617

Note that the last two are deflated versions of the data, and are 1000 times slower to read as a result.

The first line is the netCDF classic file. The non-deflated HDF5 files easily beat the read performance of the classic file, probably because the HDF5 files are in native endianness and the netCDF classic file has to be converted from big-endian to little-endian for this platform.

What is odd is that the HDF5 files have a higher average read time than their first read time. I don't get that. I expected that the first read would always be the longest wait, but once you started, subsequent reads would be faster. But not for these uncompressed HDF5 files. I am clearing the cache between each read.

Here's my timing code:

 /* Read the data variable in horizontal slices. */
    start[0] = 0;
    start[1] = 0;
    start[2] = 0;
    count[0] = 1;
    count[1] = LAT_LEN;
    count[2] = LON_LEN;

    /* Read (and time) the first one. */
    if (gettimeofday(&start_time, NULL)) ERR;
    if (nc_get_vara_float(ncid, varid, start, count, hor_data)) ERR_RET;
    if (gettimeofday(&end_time, NULL)) ERR;
    if (timeval_subtract(&diff_time, &end_time, &start_time)) ERR;
    read_1_us = (int)diff_time.tv_sec * MILLION + (int)diff_time.tv_usec;

    /* Read (and time) all the rest. */
    if (gettimeofday(&start_time, NULL)) ERR;
    for (start[0] = 1; start[0] < TIME_LEN; start[0]++)
       if (nc_get_vara_float(ncid, varid, start, count, hor_data)) ERR_RET;
    if (gettimeofday(&end_time, NULL)) ERR;
    if (timeval_subtract(&diff_time, &end_time, &start_time)) ERR;
    avg_read_us = ((int)diff_time.tv_sec * MILLION + (int)diff_time.tv_usec +
                   read_1_us) / TIME_LEN; 

File Size and Chunking in NetCDF-4 on AR-4 Data File

Trying to pick chunksizes can be hard!

chunk sizes     Size Difference (bytes)
1_128_128     0.33
1_128_256     0.25
1_128_32     0.86
1_16_128      1.56
1_16_256     0.86
1_16_32      5.75
1_64_128     0.51
1_64_256      0.33
1_64_32      1.56
10_128_128      0.18
10_128_256     0.17
10_128_32     0.23
10_16_128      0.3
10_16_256     0.23
10_16_32      0.72
10_64_128      0.2
10_64_256     0.18
10_64_32      0.3
1024_128_128    64.12
1024_128_256    64.12
1024_128_32     64.12
1024_16_128     64.12
1024_16_256     64.12
1024_16_32     64.13
1024_64_128     64.12
1024_64_256     64.12
1024_64_32     64.12
1560_128_128    0.16
1560_128_256    0.16
1560_128_32     0.16
1560_16_128     0.16
1560_16_256     0.16
1560_16_32     0.16
1560_64_128     0.16
1560_64_256     0.16
1560_64_32     0.16
256_128_128     30.57
256_128_256     30.57
256_128_32     30.57
256_16_128     30.58
256_16_256     30.57
256_16_32      30.59
256_64_128      30.57
256_64_256     30.57
256_64_32     30.58
classic     0
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« March 2017
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today