Re: [netcdf-java] [EXTERNAL] Re: Slow CDMS3 access via netcdf java

To: "Gangl, Michael E (US 398B)" <michael.e.gangl@xxxxxxxxxxxx>
Subject: Re: [netcdf-java] [EXTERNAL] Re: Slow CDMS3 access via netcdf java
From: John Caron <jcaron1129@xxxxxxxxx>
Date: Thu, 23 Sep 2021 15:18:33 -0600
Hi Michael:

1) I just wanted to know the timing of reading the header vs reading the
variable's data.
2) netcdf-4 has the dataset metadata scattered around the file, so its
making many calls to S3 Im guessing. Netcdf3 in contrast has all the info
at the beginning of the file.
3) My prejudice is that you need a server running locally on S3. There are
other possibilities, but they are more involved.
4) I think the medium range solution that Unidata is considering is
running a gRpc server. gRpc uses http-2 for higher performance. I think
they're considering backporting that to ver5 for use in thredds and
probably their python stack.
5) The comment about downloading the entire file was intended to be about
making a local cache, then deleting the file when done. Obviously it
depends on your usage pattern if thats a reasonable thing to do.

regards,
John

On Thu, Sep 23, 2021 at 1:33 PM Gangl, Michael E (US 398B) <
michael.e.gangl@xxxxxxxxxxxx> wrote:

> These are netcdf4 files. This is simply a test to show that reading a
> single variable was unbearably slow, I have no real use for the output of
> it. I wrote the test because we ran into timeouts when trying to setup S3
> access from THREDDS that also uses the netcdf-java librarby. Our entire
> archive (500TB) is in S3. We supply THREDDS access to the users to make
> accessing regional/timeseries data easier. We can’t download all of this
> locally and then delete it. Troubleshooting what was happening- that led me
> to these commands to simply open and get variable data being ‘the culprit’
> as far as I’m concerned.
>
>
>
> By setting buffer and maxS3Cache size to -1 (turning them off) it seems we
> are now simply download the file.
>
>
>
> I don’t think that’s ideal in the long run, but was the only way to get
> the dataset to be read from netcdf-java, and thereby Thredds, in any sort
> of reasonable fashion.
>
>
>
> I guess my ask is to have THREDDS run faster on object store / s3 data,
> and I’m coming to what I think the source of the issues is (netcdf-java)
> and bypassing the thredds community.
>
>
>
> -Mike
>
>
>
> *From: *John Caron <jcaron1129@xxxxxxxxx>
> *Date: *Wednesday, September 22, 2021 at 8:49 PM
> *To: *Mike Gangl <michael.e.gangl@xxxxxxxxxxxx>
> *Cc: *"netcdf-java@xxxxxxxxxxxxxxxx" <netcdf-java@xxxxxxxxxxxxxxxx>
> *Subject: *[EXTERNAL] Re: [netcdf-java] Slow CDMS3 access via netcdf java
>
>
>
> Hi Michael: what kind of files are these? netcdf3 or netcdf4? What is the
> output of your example program?
>
>
>
> Perhaps you should download it locally and access it from there, and then
> delete when done?
>
>
>
> John
>
>
>
> On Tue, Sep 21, 2021 at 1:40 PM Gangl, Michael E (US 398B) via netcdf-java
> <netcdf-java@xxxxxxxxxxxxxxxx> wrote:
>
> I’m writing on behalf of podaac [0] which really wants to move it’s
> thredds server to the AWS cloud.
>
>
>
> Our setup is essentially an EC2 instance with a lot of network bandwitdth
> to our S3 datastores. We hope to use Thredds to read from S3 directly.
> We’ve got this up and running and can get some results for very small
> requests, but we noticed any type of large or multifile query essentially
> takes too long to be effective.
>
>
>
> Digging down, I’ve constructed a test to essentially do a ‘read’ on a
> single variable from a large file (~720MB).
>
>
>
> try{
>
>               long startTime = System.currentTimeMillis();
>
>               NetcdfFile ncfile =
> NetcdfFiles.open("cdms3://ngap-cumulus-uat@aws
> /podaac-uat-cumulus-protected?MUR-JPL-L4-GLOB-v4.1/
> 20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
> <https://urldefense.us/v3/__http:/20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc__;!!PvBDto6Hs4WbVuu7!exHN1sBuZYci6A5i73bvUeQNFdDHJFNPhKOBoC4Cv0zXWohq-TGDSLWZxOVUtGmW_TKlu0R_Dg$>
> ");
>
>               long stopTime = System.currentTimeMillis();
>
>               System.out.println("Read header took " +(stopTime -
> startTime)+ " ms");
>
>
>
>               startTime = System.currentTimeMillis();
>
>               Variable v = ncfile.findVariable("analysed_sst");
>
>               System.out.println(v.read());
>
>               stopTime = System.currentTimeMillis();
>
>               System.out.println("Read variable took " +(stopTime -
> startTime)+ " ms");
>
>
>
>                 }catch (Exception e){
>
>                                 e.printStackTrace();
>
>                 }
>
>
>
> This… takes absolutely forever- still waiting on some tests to return but
> they’ve all taken > 20 minutes and I end up closing them trying to
> determine what’s going on. As a comparison, I’m able to read the entire
> 720MB file using the AWS cli in under a minute (around 25MiB/s over my
> wifi):
>
>
>
> time aws s3 cp s3://podaac-uat-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/
> 20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
> <https://urldefense.us/v3/__http:/20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc__;!!PvBDto6Hs4WbVuu7!exHN1sBuZYci6A5i73bvUeQNFdDHJFNPhKOBoC4Cv0zXWohq-TGDSLWZxOVUtGmW_TKlu0R_Dg$>
> . --profile ngap-service-uat
>
> download: s3://podaac-uat-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/
> 20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
> <https://urldefense.us/v3/__http:/20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc__;!!PvBDto6Hs4WbVuu7!exHN1sBuZYci6A5i73bvUeQNFdDHJFNPhKOBoC4Cv0zXWohq-TGDSLWZxOVUtGmW_TKlu0R_Dg$>
> to ./20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
> <https://urldefense.us/v3/__http:/20210430090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc__;!!PvBDto6Hs4WbVuu7!exHN1sBuZYci6A5i73bvUeQNFdDHJFNPhKOBoC4Cv0zXWohq-TGDSLWZxOVUtGmW_TKlu0R_Dg$>
>
>
>
> real  0m45.867s
>
> user  0m2.932s
>
> sys   0m1.891s
>
>
>
> Is there any control (or insight) that I have over why this is taking so
> long? My only guess to why it takes so long would be: It’s reading small
> pieces of the file serially or even in parallel, but the cost of connect
> and download is so expensive. Is there anyway I can instruct it to simply
> download/cache the entire file? Read much more data in a single request?
> That would seem faster at this rate. Alternatively, speeding up the read in
> anyway would be a benefit.
>
>
>
> Thanks,
>
> Mike
>
>
>
> [0] https://podaac.jpl.nasa.gov/
>
>
>
>
>
> *From: *netcdf-java <netcdf-java-bounces@xxxxxxxxxxxxxxxx> on behalf of
> Sean Arms <sarms@xxxxxxxx>
> *Date: *Thursday, September 16, 2021 at 12:30 PM
> *To: *"netcdf-java@xxxxxxxxxxxxxxxx" <netcdf-java@xxxxxxxxxxxxxxxx>, "
> thredds@xxxxxxxxxxxxxxxx" <thredds@xxxxxxxxxxxxxxxx>
> *Subject: *[EXTERNAL] [netcdf-java] A farewell message
>
>
>
> Dear THREDDS and Netcdf-Java community,
>
> My last day at Unidata will be tomorrow, September 17th, 2021. It was not
> an easy decision, to say the least, but I believe this is the right choice
> for my family and me. It has been my pleasure to serve you over these past
> ten years.
>
> Unidata will continue to host and support the development of the THREDDS
> stack. Hailey Johnson will be taking over as project lead, and will be
> reaching out to you with some details for future plans for the netCDF-Java
> library and the TDS. The roadmap that Hailey is working on contains many
> exciting developments for the future, and I look forward to watching and
> helping, in a very limited way, these developments move forward as a
> community contributor.
>
> As always, projects like netCDF-java and the TDS rely upon community
> interactions and contributions to be sustainable. Contributions to the code
> base, documentation, tackling issues, and answering questions on the
> mailing lists are all ways that you can help keep these efforts moving into
> the future. Such efforts will be incredibly helpful during the next several
> months as Hailey continues spinning-up on the THREDDS efforts, and your
> continued support and patience will be greatly appreciated throughout this
> transition period.
>
> With gratitude and hope for the future,
>
> Sean
>
> _______________________________________________
> NOTE: All exchanges posted to Unidata maintained email lists are
> recorded in the Unidata inquiry tracking system and made publicly
> available through the web.  Users who post to any of the lists we
> maintain are reminded to remove any personal information that they
> do not want to be made public.
>
>
> netcdf-java mailing list
> netcdf-java@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe, visit:
> https://www.unidata.ucar.edu/mailing_lists/
> <https://urldefense.us/v3/__https:/www.unidata.ucar.edu/mailing_lists/__;!!PvBDto6Hs4WbVuu7!exHN1sBuZYci6A5i73bvUeQNFdDHJFNPhKOBoC4Cv0zXWohq-TGDSLWZxOVUtGmW_TLfClX2tA$>
>
>
Follow-Ups:
- Re: [netcdf-java] [EXTERNAL] Re: Slow CDMS3 access via netcdf java
  - From: Gangl, Michael E (US 398B)
References:
- [netcdf-java] Slow CDMS3 access via netcdf java
  - From: Gangl, Michael E (US 398B)
- Re: [netcdf-java] Slow CDMS3 access via netcdf java
  - From: John Caron
- Re: [netcdf-java] [EXTERNAL] Re: Slow CDMS3 access via netcdf java
  - From: Gangl, Michael E (US 398B)
2021 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdf-java archives: