[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[THREDDS #ZIZ-818863]: Thredds inflation of data



Ok, now I see.
As a rule, HDF5/netcdf-4 decompression operates on
a chunk at a time (where chunk is the chunking parameters
associated with the file). Do you know what the chunking parameters
are for one of your files? You can see it by using the ncdump command
in the c-library
     ncdump -hs <filename>

My speculation is this:
1. You are using the pure-Java HDF5 reader code in Thredds
   (this is the default for netcdf-4 files).
2. The pure java HDF5 reader is either using a different
   implementation of zip or is breaking up the incoming chunk
   into smaller pieces and decompressing those smaller chunks.
I will undertake to see which case in #2 (if either) is being
used.
Any additional insight you have would be appreciated.


> Our dataflow is:
> 
> 1 - HPC Produces Chunked Compressed NetCDF Data
> 2 - HPC FTPs data to our Thredds Systems
> 3 - Downstream client systems request the data from Thredds from the opendap 
> interface.  They expect to retrieve uncompressed data, and request a sinlge 
> chunk at a time.
> 4 - Thredds reads the chunk, uncompresses it, and then sends it using the 
> opendap protocol.
> 
> It is step 4 where I want the hardware to be used, and I can force it to be 
> used by reducing the limit on the smallest decompress that is passed to the 
> hardware, however as the data Thredds is currently passing is 512 bytes 
> rather than the recommended minimum for the hardware of 16384 bytes, 
> performance is awful.
> 
> As our client systems only ever request a full chunk at a time (which is 
> always ~7MB of data when uncompressed), the behaviour I was looking for is 
> that Thredds will read a single chunk from disk (between 1MB and 2MB 
> depending on the data and compression level), pass that whole chunk as one to 
> java.util.zip (or at least more than 16384 bytes at a time, the bigger the 
> better), where the hardware will take over and inflate the data.  The 
> hardware and java.util.zip then return the uncompressed data to 
> Thredds/Opendap which then return it to the client system.
> 
> Thanks
> 
> Martyn
> 
> Martyn Hunt  
> Technical Lead, Mainframe
> Met Office  FitzRoy Road  Exeter  Devon  EX1 3PB  United Kingdom
> Tel: +44 (0)1392 884897  
> Email: address@hidden  Website: www.metoffice.gov.uk
> 
> -----Original Message-----
> From: Unidata THREDDS Support [mailto:address@hidden]
> Sent: 24 October 2017 20:22
> To: Hunt, Martyn <address@hidden>
> Cc: address@hidden
> Subject: [THREDDS #ZIZ-818863]: Thredds inflation of data
> 
> My mistake, I though you were talking about http level chunking and 
> compression.
> I thought this because of the logs you sent me.
> But I am confused about where you want to use your hardware.  Is your plan to 
> use it to decompress the file on the server before transmitting it using the 
> opendap protocol? As a reminder, the file will be decompressed before 
> translating it into the opendap format in order to pass it over the http 
> connection.
> Can you elaborate on how you ideally want that special hardware to be used?
> 
> 
> 
> > That’s not what I understood from the HDF5 doc 
> > (https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/):
> >
> > "Dataset chunking also enables the use of I/O filters, including 
> > compression. The filters are applied to each chunk individually, and the 
> > entire chunk is processed at once."
> >
> > Note that I am only talking about reading and serving a compressed netcdf 
> > file here, and not about the apache/tomcat compression for data transfer, 
> > i.e. the problem I have is likely somewhere in here:
> >
> > https://github.com/Unidata/thredds/blob/b731bcb45b6e10b7e6102e97a9ef35
> > e9fef43c93/cdm/src/main/java/ucar/nc2/iosp/hdf5/H5tiledLayoutBB.java
> >
> >
> > Martyn Hunt
> > Technical Lead, Mainframe
> > Met Office  FitzRoy Road  Exeter  Devon  EX1 3PB  United Kingdom
> > Tel: +44 (0)1392 884897
> > Email: address@hidden  Website: www.metoffice.gov.uk
> >
> > -----Original Message-----
> > From: Unidata THREDDS Support
> > [mailto:address@hidden]
> > Sent: 23 October 2017 18:34
> > To: Hunt, Martyn <address@hidden>
> > Cc: address@hidden
> > Subject: [THREDDS #ZIZ-818863]: Thredds inflation of data
> >
> > We use the Apache Httpclient system
> > (http://hc.apache.org/httpcomponents-client-4.5.x/)
> > so the fix will need to be with respect to that.
> >
> > My speculation is that there are two related issues that need investigation.
> > 1. chunking - 1 large response is chunked into multiple smaller chunks
> > on the server side that are then reassembled on the client side.
> > 2. A specific compressor -- GzipCompressingEntity, I think -- is used
> > to do the actual compression on the server side.
> >
> > I do not know the order in which these are used by the server side. It is 
> > possible that the compressor operates first and then the chunker divides 
> > that compressed output.
> > It is also possible that the chunker is first and that the compressor 
> > operates on each separate chunk.
> >
> > We will need to investigate to see which is the case (if either) and then 
> > figure out how to change the chunking and/or the compression parameters.  I 
> > suspect that sending very large (1.6m) chunks is a bad idea. So, I would 
> > hope we set things up so that the compression is first and the compressed 
> > data is then chunked.
> > Note that this will also require a corresponding change on the client side.
> >
> > In any case, this is going to take a while for me to figure it out.
> >
> >
> > =======================
> > > I am currently trying to get a compression accelerator to work with 
> > > Thredds, with the aim to reduce the CPU time that Thredds spends 
> > > decompressing chunks of data.  The compression card plugs straight in to 
> > > IBM Java8 and java.util.zip with no changes needed to the application 
> > > that uses it.  However, the replacement code with always revert to 
> > > software inflation when the size of the data passed to java.util.zip is 
> > > less than 16384 bytes.
> > >
> > > Our data is chunked and compressed, with the data we want to retrieve 
> > > ending up as 70  ~1.6MB chunks (compressed), which should inflate to ~7MB 
> > > each (see the hdls.txt file for more detail).
> > >
> > > When requesting the data, I use the following URL to request a single 
> > > chunk (our applications run through selecting each of the 70 chunks 
> > > sequentially one at a time, in this example I'm just picking one chunk.
> > >
> > > http://dvtds02-zvopaph2:8080/thredds/dodsC/decoupler/mhtest/original
> > > .n
> > > c.ascii?air_temperature[0][43][0:1:1151][0:1:1535<http://dvtds02-zvo
> > > pa
> > > ph2:8080/thredds/dodsC/decoupler/mhtest/original.nc.ascii?air_temper
> > > at ure%5b0%5d%5b43%5d%5b0:1:1151%5d%5b0:1:1535>]
> > >
> > > While I expect there may be a few smaller inflate operations at the 
> > > start/end of the request, I'd expect that there would be a single 1.6MB 
> > > --> 7MB inflate request in there.  Instead of this, in the compression 
> > > software logs, I see 1000's of 512byte inflate requests, which as they 
> > > are smaller than the min 16384byte limit the compression card has, never 
> > > get passed to the compression card.
> > >
> > > e.g.
> > >
> > > 2017-10-23T12:55:31.643894+00:00 dvtds02-zvopaph2 server: ### 
> > > [0x3ff183456a0] inflate:   flush=1 next_in=0x9b917b198 avail_in=512 
> > > next_out=0x9b917b3b8 avail_out=9200 total_in=2048 total_out=31827 
> > > crc/adler=1b38b342
> > > 2017-10-23T12:55:31.644229+00:00 dvtds02-zvopaph2 server: ### 
> > > [0x3ff183456a0]            flush=1 next_in=0x9b917b398 avail_in=0 
> > > next_out=0x9b917c422 avail_out=4998 total_in=2560 total_out=36029 
> > > crc/adler=e835c747 rc=0
> > > 2017-10-23T12:55:31.644541+00:00 dvtds02-zvopaph2 server: ### 
> > > [0x3ff183456a0] inflate:   flush=1 next_in=0x9b917b398 avail_in=0 
> > > next_out=0x9b917b3b8 avail_out=9200 total_in=2560 total_out=36029 
> > > crc/adler=e835c747
> > > 2017-10-23T12:55:31.644909+00:00 dvtds02-zvopaph2 server: ### 
> > > [0x3ff183456a0]            flush=1 next_in=0x9b917b398 avail_in=0 
> > > next_out=0x9b917b3b8 avail_out=9200 total_in=2560 total_out=36029 
> > > crc/adler=e835c747 rc=-5
> > > 2017-10-23T12:55:31.645234+00:00 dvtds02-zvopaph2 server: ### 
> > > [0x3ff183456a0] inflate:   flush=1 next_in=0x9b917b198 avail_in=512 
> > > next_out=0x9b917b3b8 avail_out=9200 total_in=2560 total_out=36029 
> > > crc/adler=e835c747
> > > 2017-10-23T12:55:31.645568+00:00 dvtds02-zvopaph2 server: ### 
> > > [0x3ff183456a0]            flush=1 next_in=0x9b917b398 avail_in=0 
> > > next_out=0x9b917c47a avail_out=4910 total_in=3072 total_out=40319 
> > > crc/adler=f1a70cdc rc=0
> > > 2017-10-23T12:55:31.645879+00:00 dvtds02-zvopaph2 server: ### 
> > > [0x3ff183456a0] inflate:   flush=1 next_in=0x9b917b398 avail_in=0 
> > > next_out=0x9b917b3b8 avail_out=9200 total_in=3072 total_out=40319 
> > > crc/adler=f1a70cdc
> > > 2017-10-23T12:55:31.646199+00:00 dvtds02-zvopaph2 server: ### 
> > > [0x3ff183456a0]            flush=1 next_in=0x9b917b398 avail_in=0 
> > > next_out=0x9b917b3b8 avail_out=9200 total_in=3072 total_out=40319 
> > > crc/adler=f1a70cdc rc=-5
> > > 2017-10-23T12:55:31.646511+00:00 dvtds02-zvopaph2 server: ### 
> > > [0x3ff183456a0] inflate:   flush=1 next_in=0x9b917b198 avail_in=512 
> > > next_out=0x9b917b3b8 avail_out=9200 total_in=3072 total_out=40319 
> > > crc/adler=f1a70cdc
> > > 2017-10-23T12:55:31.646847+00:00 dvtds02-zvopaph2 server: ### 
> > > [0x3ff183456a0]            flush=1 next_in=0x9b917b398 avail_in=0 
> > > next_out=0x9b917c272 avail_out=5430 total_in=3584 total_out=44089 
> > > crc/adler=8dba79f4 rc=0
> > > 2017-10-23T12:55:31.647166+00:00 dvtds02-zvopaph2 server: ### 
> > > [0x3ff183456a0] inflate:   flush=1 next_in=0x9b917b398 avail_in=0 
> > > next_out=0x9b917b3b8 avail_out=9200 total_in=3584 total_out=44089 
> > > crc/adler=8dba79f4
> > > 2017-10-23T12:55:31.647490+00:00 dvtds02-zvopaph2 server: ### 
> > > [0x3ff183456a0]            flush=1 next_in=0x9b917b398 avail_in=0 
> > > next_out=0x9b917b3b8 avail_out=9200 total_in=3584 total_out=44089 
> > > crc/adler=8dba79f4 rc=-5
> > >
> > > Happy to send across the datafile I'm using as an example, please let me 
> > > know if you need any other info.
> > >
> > > Thanks
> > >
> > > Martyn
> > >
> > > Martyn Hunt
> > > Technical Lead, Mainframe
> > > Met Office  FitzRoy Road  Exeter  Devon  EX1 3PB  United Kingdom
> > > Tel: +44 (0)1392 884897
> > > Email:
> > > address@hidden<mailto:address@hidden>
> > > Website: www.metoffice.gov.uk<http://www.metoffice.gov.uk>
> > >
> > >
> > >
> >
> > =Dennis Heimbigner
> > Unidata
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: ZIZ-818863
> > Department: Support THREDDS
> > Priority: Normal
> > Status: Open
> > ===================
> > NOTE: All email exchanges with Unidata User Support are recorded in the 
> > Unidata inquiry tracking system and then made publicly available through 
> > the web.  If you do not want to have your interactions made available in 
> > this way, you must let us know in each email you send to us.
> >
> >
> >
> 
> =Dennis Heimbigner
> Unidata
> 
> 
> Ticket Details
> ===================
> Ticket ID: ZIZ-818863
> Department: Support THREDDS
> Priority: Normal
> Status: Open
> ===================
> NOTE: All email exchanges with Unidata User Support are recorded in the 
> Unidata inquiry tracking system and then made publicly available through the 
> web.  If you do not want to have your interactions made available in this 
> way, you must let us know in each email you send to us.
> 
> 
> 

=Dennis Heimbigner
  Unidata


Ticket Details
===================
Ticket ID: ZIZ-818863
Department: Support THREDDS
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.