Re: [thredds] THREDDS Data Server serving from Amazon S3

To: "thredds@xxxxxxxxxxxxxxxx" <thredds@xxxxxxxxxxxxxxxx>
Subject: Re: [thredds] THREDDS Data Server serving from Amazon S3
From: David Nahodil <David.Nahodil@xxxxxxxxxxx>
Date: Wed, 15 Jul 2015 06:40:35 +0000

Hi all,

As you mention mounting S3 as a file system was problematic for a few reasons 
(including the speed issues) which is why we wanted to look into other options.

I had come across the work NOAA/OpenDAP work on Hyrax which was interesting, 
and might still be a fall-back depending on how my investigations go. I hadn't 
seen that paper though (thanks James) and that was a good summary of the work 
and findings.

There are other considerations and techniques which I am still learning about 
with AWS. It might still be the case that we need to use EBS in conjunction 
with S3.

John's point about the file system operations to provide the random access 
required is an important one. If I understand correctly it, the Hyrax work kept 
a local cache of files as they were needed. This local caching populating from 
S3 might be the technique we need to employ. It's a pain to manage resources 
like that (disk size, cache invalidation, etc.), though, so we'll have to see.

I made a bit of progress today fleshing-out a CrawlableDataset, I expect to 
come up against a few more challenges tomorrow.

Thanks all for the input!

Cheers,

David





________________________________________
From: Nathan Potter <gnatman.p@xxxxxxxxx>
Sent: Wednesday, 15 July 2015 8:17 AM
To: Jeff McWhirter
Cc: Nathan Potter; Robert Casey; John Caron; David Nahodil; 
thredds@xxxxxxxxxxxxxxxx
Subject: Re: [thredds] THREDDS Data Server serving from Amazon S3

Jeff,

I would also add that because of the time costs associated with retrieving from 
Glacier it becomes crucial to only get what you really want. To that end I 
believe that such a system can only work well if the granule metadata (and 
probably any shared dimensions or Map vectors) be cached outside of Glacier so 
that users can retrieve these information without having to incur the time and 
fiscal expense of a full Glacier retrieval.

In DAP parlance the DDS/DAS/DDX/DMR responses should be immediately available 
for all holdings. And I think that Map vectors/dimension should also be 
included in this. This would go a long way towards making such a system useful 
to a savvy client.


N

> On Jul 14, 2015, at 2:52 PM, Nathan Potter <ndp@xxxxxxxxxxx> wrote:
>
>
>
> Jeff,
>
> I wrote some prototypes for Hyrax that utilized Glacier/S3/EBS to manage 
> data. It was a proof of concept effort - not production ready by any means. 
> It seems your idea was very much in line with my own. My thinking was to put 
> EVERYTHING in Glacier and then spool things off Glacier into S3, and then 
> from there into EBS as needed by the server. Things would get spooled from 
> Glacier to S3 and then copied into an EBS volume for operational access. Last 
> accessed content would get purged, first from EBS and then later from S3 so 
> that eventually the only copy would be in Glacier, at least until the item 
> was accessed again. I think it would be really interesting to work up a real 
> system that does this for DAP services of science data!
>
> Our experience with S3 filesystems was similar to Roy’s: The various S3 
> filesystem solutions that we looked at did’t really cut it (speed & 
> proprietary utilization of S3). But managing S3 isn’t that tough, I skipped 
> the filesystem approach and just used the AWS HTTP API for S3 and it was 
> quick and easy. Glacier is more difficult: Access times are long for 
> everything. That includes 4 hours to get an inventory report, despite the 
> fact that the inventory is computed by AWS once every 24 hours. So managing 
> the Glacier holdings by keeping local copies of the inventory is important as 
> is having a way to verify that the local inventory stays in sync with the 
> actual inventory.
>
>
> Nathan
>
>> On Jul 14, 2015, at 1:00 PM, Jeff McWhirter <jeff.mcwhirter@xxxxxxxxx> wrote:
>>
>>
>> Glacier could be used for storage of all that data that you need to keep 
>> around but rarely if ever access  - e.g., level-0 instrument output, raw 
>> model output,  etc. If your usage model supports this type of latency then 
>> the cost savings (1/10th) are significant
>>
>> This is where hiding the storage semantics behind a file system breaks down. 
>> The application can't be agnostic of the underlying storage as they need to 
>> support delays in staging data, communicating to the end-user, caching, etc.
>>
>> -Jeff
>>
>>
>>
>> On Tue, Jul 14, 2015 at 1:35 PM, Robert Casey <rob@xxxxxxxxxxxxxxxxxxx> 
>> wrote:
>>
>>      Hi Jeff-
>>
>>      Of note, Amazon Glacier is meant for infrequently needed data, so a 
>> call-up for data from that source will require something on the order of a 5 
>> hour wait to retrieve to S3.  I think they are developing a near-line 
>> storage solution that is a bit more expensive to compete with Google's new 
>> near-line storage, which provides retrieval times on the order of seconds.
>>
>>      -Rob
>>
>>> On Jul 14, 2015, at 10:10 AM, Jeff McWhirter <jeff.mcwhirter@xxxxxxxxx> 
>>> wrote:
>>>
>>> On this note -
>>> What I really want is a file system that can transparently manage  data 
>>> between primary (SSD), secondary (S3) and tertiary (Amazon Glacier)  
>>> stores.  Actively used data would migrate into primary storage. Old 
>>> archived data moves off into cheaper tertiary storage. I've thought of 
>>> implementing this at the application level in RAMADDA but a file system 
>>> based approach would be much smarter.
>>>
>>> How do the archive folks on this list manage these kinds of storage 
>>> environments?
>>>
>>> -Jeff
>>>
>>>
>>>
>>>
>>> On Tue, Jul 14, 2015 at 10:44 AM, John Caron <caron@xxxxxxxx> wrote:
>>> Hi David:
>>>
>>> At the bottom of the TDM, we rely on RandomAccessFile. Do you know if S3 
>>> supports that abstraction (essentially posix file semantics, eg seek(), 
>>> read()) ? My guess is that S3 only allows complete file transfers (?)
>>>
>>> Would be worth investigating if anyone has implemented a java 
>>> FileSystemProvider for S3.
>>>
>>> Will have a closer look when i get time.
>>>
>>> John
>>>
>>> On Mon, Jul 13, 2015 at 7:59 PM, David Nahodil <David.Nahodil@xxxxxxxxxxx> 
>>> wrote:
>>> Hi all,
>>>
>>>
>>> We are looking at moving our THREDDS Data Server to Amazon EC2 instances 
>>> with the data hosted on S3. I'm just wondering if anyone has tried using 
>>> TDS with data hosted on S3?
>>>
>>>
>>> I had a quick back-and-forth with Sean at Unidata (see below) about this.
>>>
>>>
>>> Regards,
>>>
>>>
>>> David
>>>
>>>
>>>>> Unfortunately, I do not know of anyone who has done this, although we 
>>>>> have had at lease one other person ask. From what I understand, there is 
>>>>> a way to mount an S3 storage as a virtual file system, in which case I 
>>>>> would *think* that the TDS would work as it normally does (depending on 
>>>>> the kind of data you have).
>>>
>>>> We have considered mounting the S3 storage as a filesystem and running it 
>>>> like that. However, our feeling was that the tools were not really 
>>>> production ready and that we're really misrepresenting S3 by pretending it 
>>>> is a file system. So this is why we're investigating if anyone has used 
>>>> TDS with the S3 API directly.
>>>
>>>>> What kind of data do you have? Will your TDS also be in the cloud? Do you 
>>>>> plan on serving the data inside of amazon to other EC2 instances, or do 
>>>>> you plan on crossing the cloud/commodity web boundary with the data, in 
>>>>> which case that could get very expensive quite quickly?
>>>
>>>> We have about 2 terabytes of marine and climate data that we are currently 
>>>> serving from our existing infrastructure. The plan is to move the 
>>>> infrastructure to Amazon Web Services so TDS would be hosted on EC2 
>>>> machines and the data on S3. We're hoping this setup should work okay, but 
>>>> we might still have a hurdle or two to come. :)
>>>
>>>> We have someone here who once wrote a plugin/adapter for TDS to work with 
>>>> an obscure filesystem that our data used to be stored on. So we have a 
>>>> little experience in what might be involved in what might be involved for 
>>>> doing the same with S3. We just wanted to make sure that if anyone had 
>>>> done some work already that we made use of that.
>>>
>>>>> We very, very recently (as in a day ago) got some Amazon resources to 
>>>>> play around on, but we won't have a chance to kick those tires until 
>>>>> after our training workshops at the end of the month.
>>>
>>>
>>> University of Tasmania Electronic Communications Policy (December, 2014).
>>> This email is confidential, and is for the intended recipient only. Access, 
>>> disclosure, copying, distribution, or reliance on any of it by anyone 
>>> outside the intended recipient organisation is prohibited and may be a 
>>> criminal offence. Please delete if obtained in error and email confirmation 
>>> to the sender. The views expressed in this email are not necessarily the 
>>> views of the University of Tasmania, unless clearly intended otherwise.
>>>
>>>
>>> _______________________________________________
>>> thredds mailing list
>>> thredds@xxxxxxxxxxxxxxxx
>>> For list information or to unsubscribe,  visit: 
>>> http://www.unidata.ucar.edu/mailing_lists/
>>>
>>>
>>> _______________________________________________
>>> thredds mailing list
>>> thredds@xxxxxxxxxxxxxxxx
>>> For list information or to unsubscribe,  visit: 
>>> http://www.unidata.ucar.edu/mailing_lists/
>>>
>>> _______________________________________________
>>> thredds mailing list
>>> thredds@xxxxxxxxxxxxxxxx
>>> For list information or to unsubscribe,  visit: 
>>> http://www.unidata.ucar.edu/mailing_lists/
>>
>>
>> _______________________________________________
>> thredds mailing list
>> thredds@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe,  visit: 
>> http://www.unidata.ucar.edu/mailing_lists/
>
> = = =
> Nathan Potter                        ndp at opendap.org
> OPeNDAP, Inc.                        +1.541.231.3317
>

= = =
Nathan Potter                        ndp at opendap.org
OPeNDAP, Inc.                        +1.541.231.3317

Follow-Ups:
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: Robert Casey

References:
- [thredds] THREDDS Data Server serving from Amazon S3
  - From: David Nahodil
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: John Caron
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: Jeff McWhirter
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: Robert Casey
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: Jeff McWhirter
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: Nathan Potter
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: Nathan Potter

2015 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the thredds archives: