Re: [netcdf-java] [EXTERNAL] Re: Slow CDMS3 access via netcdf java

  • To: "Gangl, Michael E (US 398B)" <michael.e.gangl@xxxxxxxxxxxx>
  • Subject: Re: [netcdf-java] [EXTERNAL] Re: Slow CDMS3 access via netcdf java
  • From: Joe Lee <hyoklee@xxxxxxxxxxxx>
  • Date: Tue, 28 Sep 2021 13:38:58 +0000
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=hdfgroup.org; dmarc=pass action=none header.from=hdfgroup.org; dkim=pass header.d=hdfgroup.org; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=i/rySjfMLDY6CzFQ3sNIoJEmeI1rEMzpLtQOuvkYAkQ=; b=f0Nkbc5Y8uVJzFIi4wKMbI3NHQH78XYn5/jAUz2wxjkKCLPW55f8SemJXICgvgJAcxRtp8GFiizNgKZfRrltpwwGfujkE0NesS27+x2R0xkKoeL0lRrW6pnenK8aKeBCK+nmPl5I0oUe28JpUcW5uu9HUgs1mynA04IrrafzkmXWndF9UD4BgjPhNRIWApCZyB5VzUPY0SlQxwhwbi8kPLFLEFf4OhljNsogV2kXT3Rm7m9nKwxGBIiiJTkfTfGHPKAsCpJ4Mhdhha80OaE1QkZhUxU2e6VJb82Axl2tc+tbWydel8HSSvPDNhT9ddq7ulPOxf8vKWlmzw2STDdg3g==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=dbItI9KdbKHaNBVcOcLw/IK5C8J3Q9boVNSVePLjUAZpH52NGq+au3sEoqCCCSsSPuc9puii/FO8lbfkuWj+HYlBYL4T1Dg1Wn0moAvy4MMeCh9sWKkgBV2QWafS2fEDicPWIA2m59CF2Alvu66+tuA4iaP0n1aB5601MgcrdnklyoCtl09D+ZG4I72aNxatVKGFPVMLLdiJvuBa3DJBKf2AXKroZq8z06KsAGAqlyTYs5fROPcBr/a8SVUKWgxuk4dhubnlpRfWkyq9i2IPqp6LoLdXRwlN0zrj8mxHY1jivLa6ilDtqB6Eyd0dRucgex90a33W3VXIzIQYi3qegw==
  • Authentication-results: jpl.nasa.gov; dkim=none (message not signed) header.d=none; jpl.nasa.gov; dmarc=none action=none header.from=hdfgroup.org;
So, timeout issue is gone when NGAP is not used. Correct?
In Cloud-nomics, you get what you pay for the performance from a cheap storage 
like S3.
You may want to try put your file on EFS and mount it on EC2 that runs THREDDS 
and compare the performance.


From: netcdf-java <netcdf-java-bounces@xxxxxxxxxxxxxxxx> On Behalf Of Gangl, 
Michael E (US 398B) via netcdf-java
Sent: Tuesday, September 28, 2021 8:28 AM
To: John Caron <jcaron1129@xxxxxxxxx>; Hailey Johnson <hajohns@xxxxxxxx>
Cc: netcdf-java@xxxxxxxxxxxxxxxx
Subject: Re: [netcdf-java] [EXTERNAL] Re: Slow CDMS3 access via netcdf java

Thanks John,


  1.  Reading the header was ok, <1 second outside of the cloud, probably close 
to 1s. turning off any caching/buffering and that jumped to ~5 seconds out side 
of the cloud. Presumably faster inside, but not by a huge margin. I’d need to 
test it.
  2.  Agreed, but the time to access the header was dwarfed by the time needed 
to read large pieces of the variable metadata
  3.  Our THREDDS data server runs in aws us-west-2. We tried both NGAP and 
non-NGAP (NASA Earthdata Cloud) versions (see 
https://github.com/mgangl/tds-mur-test for some comparisons).
  4.  Hmm, my assumption/hope would be to read data from S3 (in parallel) for 
long time series and cache it locally (for a time) to optimize related/repeated 
requests for ‘hot data’. not sure http-2 would help too much in that situation, 
but something I can look into.
  5.  I’m more than happy to download the file into a cache for some time, so 
it’s not way off. But these long time series are killer for some reason. 
Wouldn’t be cost effective to cache this much data, unfortunately.

@Hailey Johnson<mailto:hajohns@xxxxxxxx>, we filed a ticket. The above 
repository tries to use a stock TDS docker image and tries to replicate the 
same setup as https://github.com/lesserwhirls/tds-s3-jpl-test for the most 
part. Very confused why we are seeing such slow access to files that were used 
in the benchmarks.

-Mike

From: John Caron <jcaron1129@xxxxxxxxx<mailto:jcaron1129@xxxxxxxxx>>
Date: Thursday, September 23, 2021 at 2:26 PM
To: Mike Gangl 
<michael.e.gangl@xxxxxxxxxxxx<mailto:michael.e.gangl@xxxxxxxxxxxx>>
Cc: "netcdf-java@xxxxxxxxxxxxxxxx<mailto:netcdf-java@xxxxxxxxxxxxxxxx>" 
<netcdf-java@xxxxxxxxxxxxxxxx<mailto:netcdf-java@xxxxxxxxxxxxxxxx>>
Subject: Re: [EXTERNAL] Re: [netcdf-java] Slow CDMS3 access via netcdf java

Hi Michael:

1) I just wanted to know the timing of reading the header vs reading the 
variable's data.
2) netcdf-4 has the dataset metadata scattered around the file, so its making 
many calls to S3 Im guessing. Netcdf3 in contrast has all the info at the 
beginning of the file.
3) My prejudice is that you need a server running locally on S3. There are 
other possibilities, but they are more involved.
4) I think the medium range solution that Unidata is considering is running a 
gRpc server. gRpc uses http-2 for higher performance. I think they're 
considering backporting that to ver5 for use in thredds and probably their 
python stack.
5) The comment about downloading the entire file was intended to be about 
making a local cache, then deleting the file when done. Obviously it depends on 
your usage pattern if thats a reasonable thing to do.

regards,
John

On Thu, Sep 23, 2021 at 1:33 PM Gangl, Michael E (US 398B) 
<michael.e.gangl@xxxxxxxxxxxx<mailto:michael.e.gangl@xxxxxxxxxxxx>> wrote:
These are netcdf4 files. This is simply a test to show that reading a single 
variable was unbearably slow, I have no real use for the output of it. I wrote 
the test because we ran into timeouts when trying to setup S3 access from 
THREDDS that also uses the netcdf-java librarby. Our entire archive (500TB) is 
in S3. We supply THREDDS access to the users to make accessing 
regional/timeseries data easier. We can’t download all of this locally and then 
delete it. Troubleshooting what was happening- that led me to these commands to 
simply open and get variable data being ‘the culprit’ as far as I’m concerned.

By setting buffer and maxS3Cache size to -1 (turning them off) it seems we are 
now simply download the file.

I don’t think that’s ideal in the long run, but was the only way to get the 
dataset to be read from netcdf-java, and thereby Thredds, in any sort of 
reasonable fashion.

I guess my ask is to have THREDDS run faster on object store / s3 data, and I’m 
coming to what I think the source of the issues is (netcdf-java) and bypassing 
the thredds community.

-Mike

From: John Caron <jcaron1129@xxxxxxxxx<mailto:jcaron1129@xxxxxxxxx>>
Date: Wednesday, September 22, 2021 at 8:49 PM
To: Mike Gangl 
<michael.e.gangl@xxxxxxxxxxxx<mailto:michael.e.gangl@xxxxxxxxxxxx>>
Cc: "netcdf-java@xxxxxxxxxxxxxxxx<mailto:netcdf-java@xxxxxxxxxxxxxxxx>" 
<netcdf-java@xxxxxxxxxxxxxxxx<mailto:netcdf-java@xxxxxxxxxxxxxxxx>>
Subject: [EXTERNAL] Re: [netcdf-java] Slow CDMS3 access via netcdf java

Hi Michael: what kind of files are these? netcdf3 or netcdf4? What is the 
output of your example program?

Perhaps you should download it locally and access it from there, and then 
delete when done?

John

On Tue, Sep 21, 2021 at 1:40 PM Gangl, Michael E (US 398B) via netcdf-java 
<netcdf-java@xxxxxxxxxxxxxxxx<mailto:netcdf-java@xxxxxxxxxxxxxxxx>> wrote:
I’m writing on behalf of podaac [0] which really wants to move it’s thredds 
server to the AWS cloud.

Our setup is essentially an EC2 instance with a lot of network bandwitdth to 
our S3 datastores. We hope to use Thredds to read from S3 directly. We’ve got 
this up and running and can get some results for very small requests, but we 
noticed any type of large or multifile query essentially takes too long to be 
effective.

Digging down, I’ve constructed a test to essentially do a ‘read’ on a single 
variable from a large file (~720MB).

try{
              long startTime = System.currentTimeMillis();
              NetcdfFile ncfile = 
NetcdfFiles.open("cdms3://ngap-cumulus-uat@aws/podaac-uat-cumulus-protected?MUR-JPL-L4-GLOB-v4.1/20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc<https://urldefense.us/v3/__http:/20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc__;!!PvBDto6Hs4WbVuu7!exHN1sBuZYci6A5i73bvUeQNFdDHJFNPhKOBoC4Cv0zXWohq-TGDSLWZxOVUtGmW_TKlu0R_Dg$>");
              long stopTime = System.currentTimeMillis();
              System.out.println("Read header took " +(stopTime - startTime)+ " 
ms");

              startTime = System.currentTimeMillis();
              Variable v = ncfile.findVariable("analysed_sst");
              System.out.println(v.read());
              stopTime = System.currentTimeMillis();
              System.out.println("Read variable took " +(stopTime - startTime)+ 
" ms");

                }catch (Exception e){
                                e.printStackTrace();
                }

This… takes absolutely forever- still waiting on some tests to return but 
they’ve all taken > 20 minutes and I end up closing them trying to determine 
what’s going on. As a comparison, I’m able to read the entire 720MB file using 
the AWS cli in under a minute (around 25MiB/s over my wifi):


time aws s3 cp 
s3://podaac-uat-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc<https://urldefense.us/v3/__http:/20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc__;!!PvBDto6Hs4WbVuu7!exHN1sBuZYci6A5i73bvUeQNFdDHJFNPhKOBoC4Cv0zXWohq-TGDSLWZxOVUtGmW_TKlu0R_Dg$>
 . --profile ngap-service-uat

download: 
s3://podaac-uat-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc<https://urldefense.us/v3/__http:/20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc__;!!PvBDto6Hs4WbVuu7!exHN1sBuZYci6A5i73bvUeQNFdDHJFNPhKOBoC4Cv0zXWohq-TGDSLWZxOVUtGmW_TKlu0R_Dg$>
 to 
./20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc<https://urldefense.us/v3/__http:/20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc__;!!PvBDto6Hs4WbVuu7!exHN1sBuZYci6A5i73bvUeQNFdDHJFNPhKOBoC4Cv0zXWohq-TGDSLWZxOVUtGmW_TKlu0R_Dg$>



real  0m45.867s

user  0m2.932s

sys   0m1.891s

Is there any control (or insight) that I have over why this is taking so long? 
My only guess to why it takes so long would be: It’s reading small pieces of 
the file serially or even in parallel, but the cost of connect and download is 
so expensive. Is there anyway I can instruct it to simply download/cache the 
entire file? Read much more data in a single request? That would seem faster at 
this rate. Alternatively, speeding up the read in anyway would be a benefit.

Thanks,
Mike

[0] https://podaac.jpl.nasa.gov/


From: netcdf-java 
<netcdf-java-bounces@xxxxxxxxxxxxxxxx<mailto:netcdf-java-bounces@xxxxxxxxxxxxxxxx>>
 on behalf of Sean Arms <sarms@xxxxxxxx<mailto:sarms@xxxxxxxx>>
Date: Thursday, September 16, 2021 at 12:30 PM
To: "netcdf-java@xxxxxxxxxxxxxxxx<mailto:netcdf-java@xxxxxxxxxxxxxxxx>" 
<netcdf-java@xxxxxxxxxxxxxxxx<mailto:netcdf-java@xxxxxxxxxxxxxxxx>>, 
"thredds@xxxxxxxxxxxxxxxx<mailto:thredds@xxxxxxxxxxxxxxxx>" 
<thredds@xxxxxxxxxxxxxxxx<mailto:thredds@xxxxxxxxxxxxxxxx>>
Subject: [EXTERNAL] [netcdf-java] A farewell message

Dear THREDDS and Netcdf-Java community,

My last day at Unidata will be tomorrow, September 17th, 2021. It was not an 
easy decision, to say the least, but I believe this is the right choice for my 
family and me. It has been my pleasure to serve you over these past ten years.

Unidata will continue to host and support the development of the THREDDS stack. 
Hailey Johnson will be taking over as project lead, and will be reaching out to 
you with some details for future plans for the netCDF-Java library and the TDS. 
The roadmap that Hailey is working on contains many exciting developments for 
the future, and I look forward to watching and helping, in a very limited way, 
these developments move forward as a community contributor.

As always, projects like netCDF-java and the TDS rely upon community 
interactions and contributions to be sustainable. Contributions to the code 
base, documentation, tackling issues, and answering questions on the mailing 
lists are all ways that you can help keep these efforts moving into the future. 
Such efforts will be incredibly helpful during the next several months as 
Hailey continues spinning-up on the THREDDS efforts, and your continued support 
and patience will be greatly appreciated throughout this transition period.

With gratitude and hope for the future,

Sean
_______________________________________________
NOTE: All exchanges posted to Unidata maintained email lists are
recorded in the Unidata inquiry tracking system and made publicly
available through the web.  Users who post to any of the lists we
maintain are reminded to remove any personal information that they
do not want to be made public.


netcdf-java mailing list
netcdf-java@xxxxxxxxxxxxxxxx<mailto:netcdf-java@xxxxxxxxxxxxxxxx>
For list information or to unsubscribe, visit: 
https://www.unidata.ucar.edu/mailing_lists/<https://urldefense.us/v3/__https:/www.unidata.ucar.edu/mailing_lists/__;!!PvBDto6Hs4WbVuu7!exHN1sBuZYci6A5i73bvUeQNFdDHJFNPhKOBoC4Cv0zXWohq-TGDSLWZxOVUtGmW_TLfClX2tA$>
  • 2021 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdf-java archives: