Let's take a deep dive into how to store floating point data by specifying a desired precision of the data, and throwing away the extra bits of precision. These extra bits tend to be quite random and therefore incompressible, and so limit how well standard compression algorithms can compress. This is a lossy compression technique.
One method of doing this is to to choose a desired precision and use it to convert the floating point into an integer using scale and offset. That looks something like:
Given a floating point data array and a floating point precision, find floating point scale and offset and minimum integer n such that for each value F in the data array:
UF = scale * P + offset where: F is the original floating point number P is an integer of n bits (packed value) UF is the reconstructed floating point number (unpacked value) and: abs(F - UF) <= precision
Here we are using an absolute precision, expressed in the units of the data array, for example 0.25 degrees Kelvin.
Here's how to implement this. Given a data array and a precision, find the mimimum and maximum value in the data array, then:
nvalues = 1 + Math.ceil( (dataMax - dataMin) / (2 * precision)) n = Math.ceil( log2(nvalues)) offset = dataMin scale = (dataMax - dataMin) / (2^n - 1)
So lets reason this through. Imagine that you are looking at the floating point numbers between dataMin and dataMax, along the real number axis. Starting at dataMin, make a mark along the axis every 2 * precision interval. Call this interval the resolution of the data. So if the range of the data was 0 to 10 K, and precision was .25 K, you would make a mark every .5 K, and you would have 21 marks, as the formula for nvalues shows. Now imagine picking any number F between dataMin and dataMax. There will always be a mark within one half interval (.25 K) from F, that is, abs(F - UF) <= precision, as required.
So now we know we need nvalues to represent the data within the desired precision, and the number of bits we need to represent nvalues is just log base 2 of nvalues, in our example log2(21) = ln(21)/ln(2) = 4.39, and we round up to 5. Since 2^5 = 32, then we know we can represent any value up to 32, including 21.
If we stick with the way we have constructed our marks every .5 K along the interval from 0 to 10, and imagine that these marks are the possible values of UF, then:
offset = dataMin scale = 2 * precision P = round((F - offset) / scale) (A)
Using this formula for P, when F = dataMin, P = 0, and when F = dataMax, P=round((dataMax-dataMin) / (2 * precision)), which in our example = 20. Any other F is going to lie between dataMin and DataMax, so all values P will be one of the 21 values between 0 and 20 inclusive.
We could leave it there but there is a possible improvement. In our example, npoints is not a power of 2, and we only need 21 values, but we can only use an integral number of bits, and we use a 5 bit number which can represent 32 values. So we are not using (32-21)/32 = 34% of the possible packed values.
We can't do anything about the need for an integral number of bits, but we can use all of the bits by decreasing the resolution (and so increase the precision of the stored data). What we want to do is to map dataMin to P=0, as before, but map dataMax to the maximum P, which for n bits is 2^n-1. So the way to do that is to set
scale = (dataMax - dataMin) / (2^n - 1)
then when F = dataMax
P = round((F - offset) / scale) P = round((dataMax - dataMin) / scale) P = round((dataMax - dataMin) /(dataMax - dataMin) / (2^n - 1)) P = 2^n - 1
A variation of this is to reserve one of the values, say the top value (2^n - 1) to mean missing; in that case we want to map dataMax to (2^n - 2) , so:
scale = (dataMax - dataMin) / (2^n - 2) /* reserve one value for missing */(Note that when we reserve one of the values for missing, we have to increase nvalues by one, which may bump n, the number of bits needed to represent the nvalues).
So whats the resolution of the data? Remember that
UF = scale * P + offset
When P increases by 1, UF increases by scale amount. So scale is the resolution, and as we showed before, the precision is resolution / 2. So in our new scaling, the precision is
precision = scale/2 = (dataMax - dataMin) / (2 * (2^n - 2))
We can show that this precision is always less than the original precision specified. By increasing the number of points we use to a power of 2, we make the resolution smaller, and get better precision. What remains to be seen is whether the compression suffers from using this expanded precision.
The GRIB spec (both GRIB-1 and GRIB-2) uses the following packing algorithm, for simple and complex bit packing and for JPEG2000 compression, i.e all the NCEP model data in our sample:
Y * 10^D = R + X * 2^E (1) where: Y = original floating point value X = stored n-bit integer R = reference value integer D = decimal scale factor E = binary scale factor
Putting this into our equation (A):
Y = R / 10^D + X * 2^E / 10^D (2) Y = offset + X * scale (A) so: scale = 2^E / 10^D offset = R / 10^D precision = scale / 2 = 2^(E-1) / 10^D
Whats notable here is that the GRIB spec constrains the precision to be a power of 2 divided by a power of 10. There is no obvious reason for such a constraint, although we do tend to think in powers of 2 and 10. (examining our GRIB NCEP GRIB1 model data, only 215 out of 24933 records use both a nonzero binary scaling and decimal scaling).
I have often wondered how the GRIB producers are setting compression parameters such as nbits. Now Im pretty sure I know, at least for NCEP, and its pretty much what you would expect. Each individual variable has a precision assigned to it, using the decimal and/or binary scale factor exponents. Using the above algorithm, when a record is written, the data range is found, and nbits is calculated. However, scale is not set to use the full range of the packed integers. In fact there is no way to do so, since the scale can only be specified using a power of 2 divided by a power of 10. Since the number of bits needed depends on both the precision and the data range (min, max), most of the time npoints will not be a power of 2, and some of the range of the packed integers will not be used. Examining the actual NCEP model data confirms all of this.
If you would like to look at data packing in GRIB data, use the latest 4.5.3 version of ToolsUI and go to the IOSP/GRIB1/GRIB1data or IOSP/GRIB2/GRIB2data tab, enter your GRIB dataset and examine the tables, eg:
In summary, we have discovered that GRIB uses a suboptimal way to specify the precision of the packed data:
- precision must be specified as a power of 2 divided by a power of 10, instead of an arbitrary float.
- the scaling factor is not set to use the full range of the packed numbers, which would allow more precision in the same number of bits.
Yet to be discovered is if using the full range of packed numbers makes compressing those numbers worse. If so, its possible the GRIB designers decided not to do that, reasoning that once the desired precision had been acheived, no further cost in data size should be paid. Even if true, there's no reason that I can think of not to allow the user to specify the desired precision as an arbitrary floating point number, eg 1/3.
Meanwhile, I will explore the possibility of storing "limited precision" floating point data in netCDF-4, as a way to make file sizes small. This could make netCDF-4 an acceptable alternative to GRIB.
For further background, see: