Unidata Developer's Blog

Extensible NcML for AI/ML Ready Data on the THREDDS Data Server

2024-08-05T09:55:22+00:00

Leo Matak

by Leo Matak
2024 Unidata summer intern

I would like to begin by saying that this internship has definitely been one of the top highlights of my Ph.D. journey. Coming to Boulder from Houston made me realize how different climates and cities can be. I loved Boulder from the first moment I arrived. It is such a pedestrian-friendly city, and everyone seems to be jogging, biking, walking, or engaging in some other fitness activities. Following the example of other residents of Boulder, I activated my BCycle account and rode an electric bike every day to work and back, and on the weekends, I explored and experienced Boulder! What a fun way of commuting that was!

As a Ph.D. student researching Boundary Layer Dynamics, UCAR has always had a special place in my life because of all the data, tools, and knowledge it has provided. These resources aided me in my research and helped deepen my passion and skills in Earth Systems Sciences. Being accepted into an internship at NSF Unidata was truly a remarkable experience, as I used NSF Unidata's products (TDS, NetCDF, MetPy) on a daily basis. Now, I had the opportunity to go there, meet the people, and work with them on these software packages.

In today's technological world, where the volumes of collected and measured datasets are only expanding, Machine Learning (ML) and Artificial Intelligence (AI) have never been more suitable for analysis, exploration, and conclusion derivation. However, many ML applications are challenged by raw datasets that may contain outliers, which could significantly impact results. These outliers might simply be instrument errors or other types of errors. Data cleaning, such as handling missing data, is another common issue. With such problems, having data transformation and manipulation methods readily available is crucial.

Figure 1: Data processed locally
(click to enlarge)

I spent most of my working hours implementing the idea of server-side virtual data processing. This means that data on the THREDDS Data Servers (TDS) could be virtually processed without actually modifying the data. As such, the data integrity would remain intact, but it could be optimized for ML/AI.

To make this work, I implemented what is known as a Service Provider in Java, where a concrete implementation of a service interface can be loaded at runtime without any hardcoding or modifying of the existing code. Before that, to get familiar with the large NetCDF-Java directory tree and the underlying code structure overall, I created a new class called Classifier to categorize data into arbitrary categories. After successfully writing this class and a couple of tests for it, I was ready to move on to the next step.

The implemented Classifier is defined with the following NcML code:

  <variable name="Temperature_height_above_ground">  <attribute name="classify" value="0 65 0; 65 85 1; 85 inf 2;" />  </variable>

which will perform the following classification:

Temperature [F]	[0,65)	[65,85)	[85,inf>
Assigned class	0 (bearable)	1 (comfort)	2 (not good)

The classification is applied locally and the results are shown in Figure 1.

We wanted to allow data processing to be done via the NetCDF Markup Language (NcML). The NcML is a very simple and intuitive way of specifying what kind of transformation is wanted along with the details of application. To make it work with Java, it required additional code to be refactored and added. After some additional thinking, brainstorming and a couple of meetings, I managed to create working code that allowed TDS administrators to implement custom data transformations to be applied server-side.

Figures 2 & 3: Data processed directly on TDS
(click to enlarge)

After we got that extensible NcML mechanism working, we used my Classifier transformation as an example on NSF Unidata's own TDS. We chose the GFS 20km CONUS dataset variables for relative humidity and temperature as the inputs: https://tds.scigw.unidata.ucar.edu/thredds/catalog/grib/NCEP/GFS/CONUS_20km/catalog.html (input data is from NSF Unidata's Science Gateway THREDDS Data Server). The corresponding output (classified) data variables can be viewed on NSF Unidata's thredds-test data server, where the transformation took place: https://thredds-test.unidata.ucar.edu/thredds/catalog/classified/grib/NCEP/GFS/CONUS_20km/catalog.html.

The Classifier can be used on raw temperature and relative humidity datasets directly on the TDS using the following NcML:

  <variable name="Relative_humidity_height_above_ground">  <attribute name="classify" value="0 45 0; 45 75 1; 75 100 2;" />  </variable>

The results are shown in Figures 2 and 3. (Click to enlarge, then click on the image to see the classified version.)

In conclusion, this summer has been one of the greatest experiences so far. Not only have I learned and refined my knowledge and skills, but I've also met many wonderful people. I had the opportunity to spend quality time with the team at NSF Unidata and fellow students interning at UCAR. The skills I've honed, such as Java and Git, will undoubtedly benefit my future career and prepare me for upcoming challenges. I highly recommend applying to Unidata's internship programs, as they provide a life-lasting positive experience.

My summer with Java: Implementing dataset enhancements on THREDDS Data Server

2023-08-08T15:46:22+00:00

Jessica Souza

by Jessica Souza
2023 Unidata summer intern

During my internship, I worked with the Unidata THREDDS team. My intentions this summer were to learn Java, improve my coding skills, and have experience using it in real world applications. I began my journey by converting existing unit tests for the netCDF-Java library, which is tightly linked to the THREDDS Data Server (TDS) code, to the JUnit Java testing framework. Once I got this practice with Java and had a working development environment, I was able to start working on my summer project.

With the extensive increase in the use of machine learning models in Earth science related research, my project was an initiative in the direction of providing new datasets intended for machine learning use. Since Earth sciences has become substantially data-driven, with a variety of forecast models, large model simulations, and satellite missions, there is an unprecedented rise in raw unprocessed data. When working with machine learning models, significant preprocessing of the data is required; this involves cleaning, re-scaling, and splitting the dataset. The goal of re-scaling is to transform features to be on a similar range and improve the performance and training stability of the model. It is not always that re-scaling is necessary, but it is essential when dealing with multiple variables. My project focused on performing dataset preprocessing, in this case re-scaling, before access by users targeting machine learning applications. After reviewing 13 papers from the AMS journal Artificial Intelligence for the Earth Systems (AIES), my Unidata mentor and I selected standardization and normalization (common types of re-scaling) for implementation as part of my project.

Preprocessed data on TDS (click to enlarge)

I decided to implement two functions in Java based on the StandardScaler and MinMaxScaler functions from Scikit-learn, a Python machine learning library. Using an external Java library suitable for large data streams (Apache Commons Mathematics Library), I created the classes Standardizer and Normalizer. Next, they were integrated into the netCDF-java codebase. This process included creating constants/attributes in the Common Data Model class for the standardization and normalization, adding Standardizer and Normalizer in the set of possible data enhancements, applying enhancements to the data if 'standardizer', or 'normalizer' was a netCDF variable attribute and data was floating point type. Using the new classes in the TDS through the NetCDF Markup Language (NcML) allowed the creation of a virtual dataset that could be returned to the user without altering the original data or requiring additional disk usage. By making these processed datasets available to TDS users, we reduce the amount of data preprocessing required on the user end.

The initial datasets chosen for preprocessing on the THREDDS test server were forecast (GFS) and satellite (GOES 18) data, due to their frequent use in the AIES papers reviewed. In addition to adding a mechanism to access the preprocessed datasets on the TDS test server, we included Jupyter notebooks for visualization of the preprocessed variables. I also created automated tests to evaluate if the code behavior was as expected, which involved unit testing and integration testing.

During the project, I also gained experience with GitHub in creating issues, pull requests and code review. Furthermore, tests on the performance difference with the use of the re-scaling were also evaluated. As next steps, the already reasonable results of the performance tests could be improved and more datasets relevant to the users could be provided.

This summer offered me invaluable personal and professional development opportunities including the Unidata Users Workshop, Project Pythia Hackathon, and the professional development workshops series for the UCAR interns. The combination of all the experiences throughout the internship contributed to help me build confidence in becoming a contributor to open source. Working on my project and the dedicated support from my mentor and the THREDDS team has deepened my passion for scientific software development.

Hacktoberfest challenge - share your Jupyter Notebooks

2020-10-07T15:17:24+00:00

Unidata is looking for community contributions to the Jupyter Notebook service, which provides Jupyter Notebooks to facilitate accessing, exploring, and visualizing datasets in the TDS. Users who have written Notebooks that interface with TDS datasets (or Python scripts that can be converted) or are interested in creating one can contribute in one of two ways (or both!):

Contribute a generic viewer that would be a useful in a typical THREDDS Data server (see the GitHub issue here).
Contribute a viewer for a specific dataset, data type, or catalog included in the Unidata THREDDS Data Server (see GitHub issue here).

The issues above are eligible for Hacktoberfest: you get Hacktoberfest credit and help the Unidata community! For more information on contributing and some helpful links, visit the GitHub issues above.

Happy hacking!

My Summer of Improving the TDS Web Interface

2018-08-10T14:35:07+00:00

Hailey Johnson

by Hailey Johnson
2018 Unidata summer intern

During my time here at Unidata, I've focused on extending the THREDDS Data Server (TDS) web interface and services. I spend the first few weeks of the summer redesigning the interface to be more intuitive to end users and implementing UI changes using Thymeleaf HTML templating. The new TDS catalog pages are designed with a "plug-and-play" structure, allowing users to override or insert their own contributed HTML, which is processed by a server-side Thymeleaf template resolver.

(click to enlarge)

Halfway through the summer, I thought that it would be useful if every dataset page in the TDS generated a code snippet demonstrating how to access that dataset and its metadata using Siphon. This idea rapidly escalated into a TDS Jupyter Notebook service, which returns a Jupyter Notebook (ipynb) pre-populated with a dataset ID and catalog URL. The default Notebook describes and demonstrates use of the Siphon remote access protocol and uses notebook widgets to allow users to explore variables within the dataset. Users can supply other Notebook files in place of (or in addition to) the default Notebook. Contributed Notebooks can be registered as valid for all datasets or a subset using the ipynb metadata; the mapping between Notebooks and datasets can be specified by dataset IDs, parent catalogs, or data type (e.g. grid, point). (Administrators of TDS version 5.0 and later will find information on tailoring the notebooks in the section of the documentation titled 'Extending TDS Services.')

You can find my contributions to the TDS live on Unidata's thredds-dev data server!

The goal of these new features is to help lower the barrier-to-entry and enable a broader community of users to access data using the TDS. I've enjoyed the work I've done at Unidata this summer would like to continue working on problems with data management and access in the geoscience in my future research and career.

THREDDS License Change

2018-02-12T12:22:47+00:00

As we approach the first public beta of version 5.0 of the THREDDS Data Server (TDS), we have decided to revisit our software license. Currently, both NetCDF-Java and the TDS are released under the same license that the netCDF C library uses, which is a license that was 'home grown' at UCAR. It's usually called an 'MIT-style license,' though it is perhaps more similar to the BSD-3 Clause license. Rather than continue to use the 'home grown' license, we will be moving to a standard, off-the-shelf BSD-3 license, bringing the TDS and NetCDF-Java packages more in line with standard practice within the Open Source community.

What This Means For You

If you are a user of the TDS, this change in licensing does not affect you.

If you are a developer who includes technology from the TDS or netCDF-Java in your own work, you will find that the new license is slightly less restrictive than the current license, in that it does not explicitly request that you credit UCAR/Unidata in publications that result from your use of the technology. Of course, we still appreciate any acknowledgement you do provide; such credits are very useful to Unidata in making the case to our funders that the work we do is valuable to our community. If you find Unidata's contributions to community cyberinfrastructure useful, please consider citing the technology you're using, or perhaps including a badge in your online materials.

What is Changing

Here is the text of the current license:

  Copyright 1998-2015 University Corporation for Atmospheric Research/Unidata    Portions of this software were developed by the Unidata Program at the  University Corporation for Atmospheric Research.    Access and use of this software shall impose the following obligations  and understandings on the user. The user is granted the right, without  any fee or cost, to use, copy, modify, alter, enhance and distribute  this software, and any derivative works thereof, and its supporting  documentation for any purpose whatsoever, provided that this entire  notice appears in all copies of the software, derivative works and  supporting documentation.  Further, UCAR requests that the user credit  UCAR/Unidata in any publications that result from the use of this  software or in any product that includes this software. The names UCAR  and/or Unidata, however, may not be used in any advertising or publicity  to endorse or promote any products or commercial entity unless specific  written permission is obtained from UCAR/Unidata. The user also  understands that UCAR/Unidata is not obligated to provide the user with  any support, consulting, training or assistance of any kind with regard  to the use, operation and performance of this software nor to provide  the user with any updates, revisions, new versions or 'bug fixes.'    THIS SOFTWARE IS PROVIDED BY UCAR/UNIDATA 'AS IS' AND ANY EXPRESS OR  IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE  DISCLAIMED. IN NO EVENT SHALL UCAR/UNIDATA BE LIABLE FOR ANY SPECIAL,  INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING  FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,  NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION  WITH THE ACCESS, USE OR PERFORMANCE OF THIS SOFTWARE.

Here is the new license:

Copyright 1998-2018 University Corporation for Atmospheric Research/Unidata Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Event Notification for Thredds Servers

2017-08-07T13:35:42+00:00

Initial Draft: 2017-08-05
Last Revised: 2017-08-05
Author: Dennis Heimbigner, Unidata

Introduction
Architecture
Requirements
Initial Event Set
Topic Space Design
Implementation
Relation to LDM
Performance Costs
Security
Summary
Appendix A. Miscellaneous Notes

Introduction

Periodically some of the Thredds servers run by Unidata get seriously overloaded. One cause is because external users poll the Thredds server to see what has changed. If the polling rate is too high then the performance of the Thredds server can seriously deteriorate.

I am proposing here to mitigate this problem by allowing Thredds servers to generate events that signal changes that might be of interest to users. Then, instead of polling, these users can watch for specific changes events and use that information to update their local databases (or whatever).

The cost tradeoff for Unidata is the cost of periodic "hammering" versus the maintainence of an event server to distribute change events to users.

Looking ahead, it is also possible that this proposal can facilitate inter-server communications. This means that multiple Thredds servers could communicate useful information. This is speculative for now, but should be kept in mind.

Architecture

I are proposing a pretty standard publish-subscribe system for use by a Thredds server. In this architecture, there are hooks in various places in the Thredds code that send short messages to a separate "broker" server.

On the client (user) side, each client registers with the broker to tell it the kinds of messages in which it is interested.

So the flow is:

The server generates a change message
The message is received by the broker
The broker forwards the message to all clients that are registered as interested in that kind of message.

Requirements

In order to be useful to Unidata and its community, I require certain capabilities for the event system.

Topic-based Messages

In event systems, there are typically two ways to identify messages: by queue and by topic.

A topic based message is one that has an associated structured string used to classify the message. Often, the structure of the string is a tree represented by the format field.field.field... where each field is some identifier. This format can be used, for example, to mark the message as referring to some file in a tree structured file system. Thus a file /f1/f2/f3 might be mapped to the topic f1.f2.f3.

A queue-based identification is one in which a message is sent to a specific named queue. It is isomorphic to a topic system as far as a sender is concerned because each distinct topic string can be the name of a queue. I will not consider queue-based system further.

Topic Wildcards

On the client side, the client must be able to register for messages by specifying a pattern indicating the message topics in which it is interested. It is desirable to allow a client to register for a number of different topics by specifying a pattern containing wildcards (as is common in e.g. Unix file specifications).

If, for example, our client was interested in events about all files within /f1/f2, it should be possible to specify a topic pattern such as "f1.f2.*".

Durability

Suppose a client is not active or not registered with the broker at the time an event is received by the broker. If later the client registers, it will not see that previously generated message. This is a problem because a client will be forced to again access the Thredds server to see what happened while it was offline.

To deal with this problem, I require that our broker support "durable" messages. In the event community, this means that the broker will cache messages for some period of time. When a client registers, it will receive any cached messages that match its pattern. Supporting durability is tricky because of issues such as message retention policies and message duplication. Nontheless, this is an essential requirement in order to avoid polling as much as possible.

Persistence (Optional)

In this context, persistence means that cached messages are maintained even if the broker crashes. It is closely related to durability if the durable messages are stored in some kind of file-based database system. I do not require persistence, although it may come for free with some brokers.

Transactions (Not Required)

Transactions in event systems are similar to those of database systems. It essentially means that a message is guaranteed to be delivered or it appears as if it is not delivered at all. It is unlikely that transactions will be required for our event system.

Initial Event Set

The initial set of events of interest will be catalog change events. Specifically these.

Insert - when a catalog has a new entry
Delete - when a catalog has an existing entry removed
Modify - when a catalog has an existing entry modified.

Modification is tricky since it effectively tested by looking at the modification date for a file. This may not be an actual change. Note also that it may turn out that modifications will actually look like a deletion followed by an insertion.

Note that modification false positives are generally acceptable if not too frequent. This is because the cost to the client is that it retrieves a catalog entry that has in fact not been modified.

False negatives are much less acceptable. This will occur if an insertion, deletion, or modification occurs but no corresponding event is generated. If a client cannot trust the server in this regard, then the event system will not be used.

Note also that change events should also apply to the creation of new catalogs as well as the files within a catalog.

Topic Space Design

Topic Space Design is an important aspect of this proposal. That is, we need to set of a topic tree such that clients can specify what they want with a reasonable degree of specificity.

From LDM, we know that this is important in order to avoid clients just subscribing to everything. I would hope that using a wildcard system -- as opposed to e.g. regular expressions -- is sufficiently simple that clients will not be tempted to ask for everyting. Realistically, this is probably a forlorn hope because it is likely that pollers want to know everything about a specific server.

My initial thought is that the root of our topic space is "Unidata.Thredds". From there, I would like to specify a particular server via a DNS name + port. There is a problem since DNS names contain dots. It may be necessary to use encoding, the url %xx for example, to change the dots and semicolons in the DNS name So one might say "Unidata.Thredds.motherlode%2eucar%2eedu%3a8080."

From there, the obvious choice is to encode the catalog path as the rest of the topic. So, for example:

  Unidata.Thredds.thredds%2eucar%2eedu%3a8080.catalog.grib.NCEP.GEFS.Global%2e1p0deg%2eEnsemble.members-analysis.GEFS%2eGlobal%2e1p0deg%2eEnsemble%2eana%2e20170723%2e0000.grib2.*

The character escaping issues needs some thought.

Implementation

Currently, Thredds does not immediately detect changes to its underlying file system. Rather, it dynamically rebuilds some catalogs when accessed. The dynamically generated catalog will, of course, reflect changes to that catalog since it was last accessed.

John Caron left nascent code (CatalogWatcher) in Thredds to actually operate at the time a change occurred (as opposed to when a catalog is retrieved). This is more or less the eager vs lazy issue.

So, in order to make this proposal work, I need to do (at least) the following.

Complete the existing code pieces such as CatalogWatcher
Add code to convert a file system reference to a catalog system url.
Add an event sender to Thredds

This appears to be a straightforward set of modifications. Of course Murphy's law says that I am forgetting something.

Relation to LDM

My current broker of choice is Apache ActiveMQ. But I cannot help but notice that LDM functionality is related to this proposal. So is fair to ask if LDM could be adapted to serve as the broker. Here are some issues that would need to be considered. Note that my knowledge of the current capabilities of LDM may be out of date.

Message Size: LDM ships files, not short messages. In this sense, LDM is overkill.
Volume: my speculation is that the volume of small messages would not be all that large; it might even be similar to the volume of distinct files shipped by LDM
Multiple Languages: whatever broker we use, it must be possible to write clients in a variety of programming languages: C, Python, Java at least.

While it would be nice to make use of other Unidata technology, I currently do not plan to pursue this path. As noted, my current target is an ActiveMQ broker with JMS publishers and subscribers.

Performance Costs

There is a performance cost in having the Thredds server generate events. But it is difficult to assess these costs without knowing how many events are being generated.

It should be possible to experimentally measure the number of events by adding counters to the CatalogWatcher code and running it for an extended period of time. Periodically, the counters can be dumped to the logs.

Ideally we would have R X 3 x 24 x 31 counters so we could compute the number of events per hour, day, week, and month per catalog root (assuming R roots). This is not a large amount of space . So at the end of every hour, day, and month, a log entry would be generated with a terse listing of the counters. The counters would then be reset.

The big problem is that these counters need to be executed on one of the Unidata productions systems (aka motherlode) in order to get realistic numbers. It is not clear if this is possible.

Note that the measurements would be taken over all the catalog roots. It is possible that other parts of the file system such as the GRIB indices would also need to be included.

Security

The question is: how open should our broker be to arbitrary clients? The short answer is that it probably should have the same access controls as the associated Thredds server. Again, it is not clear what the requirements here should be.

Summary

The original motivation for this proposal was to try to mitigate clients hammering our production servers. The idea is that if they have access to changes events, they do not ned to poll our servers (or least quite so often).

Will it work? It depends on several factors:

Is is easy for clients to use?
Are they willing to change?
Is it sufficiently reliable that clients will not feel they are loosing events?

Appendix A. Miscellaneous Notes

Multiple Servers Per Broker

The above discussion assumes that there is the server - broker association is one to one. Other arrangements are possible such as multiple servers sharing a single broker or a server using multiple brokers. The relative merits of these alternatives is unclear to me, but the possibility is worth noting.

Thredds Persistence

I could implement durability in the Thredds server by keeping a queue of changed directories and allowing clients to ask for everything since . In effect, I would be subsuming the broker as part of Thredds. Possible and maybe an acceptable approach.

The Death of Server-Side Computing

2017-06-05T12:25:45+00:00

Initial Draft: 2017-5-28
Last Revised: 2017-6-5
Author: Dennis Heimbigner, Unidata

Introduction
The Alternative: Jupyter
The Language: Python
The Notional Architecture
Accessing the Jupyter Server
Asynchronous Operation
Thredds Value Added
Specialized Capabilities
Access Controls
Resource Controls
Planned Activities
References

Introduction

For a number of years, the Unidata Thredds group has been in the process of "implementing" server-side computation Real-Soon-Now (as the saying goes).

Server-side computing embodies the idea that it is most efficient to physically co-locate a computation with the datasets on which it is operating. As a rule, this meant having a server execute the computation because the dataset was controlled by that server. Server-side computing servers for the atmospheric community have existing in various forms for a while now: GRADS, DAP2 servers, and ADDE, for example.

One -- and perhaps The -- major stumbling block to server-side computing is defining and implementing the programming language in which the computation is coded. In practice, server-side systems have developed their own language for this purpose. This is a problem primarily because it is very difficult to define and implement a programming language. Often the "language" started out as some form of constraint expression (e.g. DAP2, DAP4, and ADDE). Over time, it would accrete other capabilities: conditionals, loops, etc. In time, it grew into a badly designed but more complete programming language. Since it was rarely implemented by language/compiler experts, it usually was quirky and presented a significant learning curve for users.

The advantage to using such a home grown language was that it could be tailored to the dataset models supported by the server. It also allowed for detailed control of programs. This made certain other issues easier: access controls and resource controls, for example.

The author recognized the language problem early on and was reluctant to go down that path. As the primary "pusher" for server-side computing at Unidata, this has delayed implementation for an extended period.

The Alternative: Jupyter

Fortunately, about three years ago, project Jupyter [1] was created as an offshoot of the IPython Notebook system. It provided a multi-user, multi-language compute engine in which small programs could be executed. With the advent of Jupyter, IPython then refactored its computation part to use Jupyter.

From the point of view of Unidata, Jupyter provides a powerful alternative to traditional server-side computing. It supports multiple, "real" programming languages. It is a server itself, so it can be co-located with an existing Thredds server. And, most importantly, it is designed to execute small programs written in any of its supported languages.

In the rest of this document, the term "program" will, as a rule, refer to programs executing within a Jupyter server.

The Language: Python

In order to avoid the roll-your-own language problem, it was decided to adopt wholesale an existing modern programming language. This meant that the language was likely to be complete right from the start. Further, the learning curve would be reduced because a significant amount of supporting documentation and tutorials would be available.

We have chosen Python as our preferred language. We made this choice for several reasons.

Python is rapidly being adopted by the atmospheric sciences community as its language of choice.
There is a very active community that is developing packages for use by the scientific community and more specifically for the atmospheric sciences community. Examples are numerous, including numpy, scipy, metpy, and siphon.
It is one of the languages supported by Jupyter.

To the extent that Jupyter supports other languages, it would be possible to write programs in those languages. However, I would not expect Unidata to expend any significant resources on those other languages. The one possible exception is if/when Jupyter supports Java.

The Notional Architecture

The notional architecture we now espouse is shown in Figure 1. Basically, a standard Thredds server runs along side a Jupyter server. A program executing in the Jupyter server has access to the data on the Thredds server either using the file system or using some streaming protocol (e.g. DAP2). File access is predicated on the assumption that the two servers are a co-located and share a common file system.

The Thredds server currently requires and uses some form of servlet engine (e.g. Tomcat). We exploit that to provide a front-end servlet to act as intermediary between a user and the Jupyter server (see below).

So now, instead of sending a program to the Thredds server, it is sent to the Jupyter server for execution. That executing program is given access to the Thredds server using a variety of packages (e.g. Siphon [2]). Once its computation is completed, its resulting products can be published within a catalog on Thredds to make it accessible to user programs. Once in the catalog. that product can be accessed by external clients using existing streaming protocol services. In some cases, it may also be possible to access that product using a shared file system.

This discussion assumes the existence of a single Jupyter server, but it will often be desirable to allow mutltiple such servers. Examples of the utility of multiple servers will be discussed in subsequent sections.

Accessing the Jupyter Server

Access to the Jupyter server will be supported using several mechanisms. Each mechanism has a specific use case.

IPython Access

Though not shown in Figure 1, it is assumed that existing IPython access to Jupyter is available. This path is, of course, well documented elsewhere in the IPython+Jupyter literature.

Web-based Access

Another use-case is to provide access for scientists with limited programming skills or for other users requiring simple and occasional computations.

The servlet box in Figure 1 illustrates this. For this case. Client web browsers would carry out forms based computations via the front-end servlet running under some Apache Tomcat (other other servlet engine).

Programmatic Access

Scientists will still write standalone programs that need to process computed data. Others will write value-added wrapper programs to provide, for example, additional capabilities such as plotting or other graphical presentation.

These use cases will require the ability to upload and execute programs from client-side programs. The simplest approach here is to build on the web-based version. That is, the client side program would also access the servlet, but using a modified and stream-lined interface.

Asynchronous Operation

Some computations will take a significant amount of time to complete. Submitting such a computation through the Thredds server interface is undesirable because it requires either blocking of the client for long periods of time or complicating the Thredds server to make it support asynchronous execution. The latter usually involves returning some kind of token (aka future) to the client that it can interrogate to see if the computation is complete. Or alternatively, providing some form of server to client event notication mechanism. In any case, such mechanisms are complicated to implement.

Direct client to Jupyter communication (see previous section) can provide a simple and effective alternative to direct implementation of asynchronous operation. Specifically, the client uploads the program via IPython or via a web browser to the Jupyter server. As part of its operation the program uploads its final product(s) to some catalog in the Thredds server. The client is then responsible for detecting that the product had been uploaded, which then enables further processing of that product as needed.

Thredds Value Added

Given the approach advocated in this document, on what should Unidata focus to support this approach.

Accessing Thredds Data

First and foremost, we want to make it easy, efficient, and fast for programs to access the data within a co-located Thredds server.

Thredds currently provides a signficant number of "services" [3] through which metadata and data can be extracted from a Thredds server. These include at least the following: DAP2 (OpenDAP), DAP4, HTTPServer, WCS, WMS, NetcdfSubset, CdmRemote, CdmrFeature, ISO. NCML, and UDDC.

The cost to access to data via some commonly supported protocols, such as DAP2 or CdmRemote, is relatively independent of co-location, so using such protocols is probably not the most efficient method.

File Download

The most efficient inter-server communication is via a shared file system accessible both to the Thredds server and the Jupyter server.

As of Thredds 5 it is possible to materialize both datasets and (some kinds of) streams as files: typically netcdf-3 (classic) or netcdf-4 (enhanced). One defines a directory into which downloads are stored. A special kind of request is made to a Thredds server that causes the result of the query to be materialized in the specified directory. The name of the materialized file is then returned to the client.

Siphon

The Siphon project [2,4] is designed to wrap access to a Thredds server using a variety of Thredds services. As such, it will feature prominently in our system. Currently, siphon supports the reading of catalogs, and data access using the Thredds netcdf subset service (NCSS), CdmRemote, and Radar Data.

Operators

The raison d'etre of server side computation is to input datasets, apply operators to them and produce new product datasets. In order to simplify this process, it is desirable to make available many high-level operators so that a computation can be completed by the composition of operators.

Often, server-side computation is illustrated using simple operations such as sum and average. But these kinds of operators are likely to only have marginal utility; they may be useful, but will not be the operators doing the heavy lifting of server side computation.

Accumulating useful operators is possibly another place where Unidata can provide added value. Unidata can both provide a common point of access, as well as providing some form of vetting for these operators.

One example is Pynco [5]. This is a Python wrapping of the netCDF Operators (NCO) [6]. NCO is currently all command line, so Pynco wraps them to allow programmatic invocation of the various operators.

As part of the operator support, Unidata might wish to create a repository (using conda channels or Github?) to which others can contribute.

Publication (File Upload)

When a program is executed within Jupyter, it will produce results that need to be communicated to others -- especially the client originating the computation. The obvious way to do this is to used the existing Thredds publication facilities, namely catalogs.

As of Thredds 5, it is possible to add a directory to some top-level catalog. Uploading a file into that directory causes it to appear in the specified catalog. Uploading can be accomplished either by file system operations or via a browser forms page.

Specialized Capabilities

Another way to add value is to make libraries available that support specialized kinds of computations.

GPU Support

The power of Graphics Processing Units (GPUs) has significantly increased over the last few years. Libraries now exist for performing computations on GPUs. To date, using a GPU on atmospheric data is uncommon. It should be possible improve the situation by making operators available that use a GPU underneath to carry out the computation.

Machine Learning Support

Artificial Intelligence, at least in the form of machine learning, is another example of a specialized capability. Again, use of AI to process atmospheric data is currently not common. It should be possible to build quite sophisticated subsystems supporting the construction of AI systems for doing predictions and analyses on such data.

Access Controls

There is a clear danger in providing a Jupyter server open to anyone to use. Such a server is a potential exploitable security hole if it allows the execution of arbitrary code. Further, there are resource issues when anyone is allowed to execute a program on the server.

Much of the support for access controls will depend on the evolving capabilities implemented by the Jupyter project. But we can identify a number of access controls that will be needed to protect a Jupyter server.

Sandboxing

The most difficult problem is to prevent the execution of arbitrary code on the Jupyter server. Effectively, such code must be sandboxed to control what facitilities are made available to executing programs. Two sub-issues arise.

Some Python packages must be suppressed. Arbitrary file operations and sub-process execution are two primary points of concern.
Various packages must be useable: numpy, metpy, siphon, for example. They are essential for producing the desired computational products. However, the security of the system depends on the security of those packages. If they provide accessible security-flaws, then security as a whole is compromised.

Authentication

Strong authentication mechanisms will need to be implemented so that only authorized users can utilize the resources of a Jupyter server. Jupyter authentication may need to be coordinated with the Thredds server so that some programs executed on Jupyter can have access to otherwise protected datasets on the Thredd server. This is one case where multiple Jupyter servers (and even multiple Thredds servers) may be needed to support specialized access to controlled datasets by using isolated Jupyter servers.

Resource Control Mechanisms

Uncontrolled execution of code can potentially be a significant performance problem. Additionally, it can result in significant costs (in dollars) being charged to the server's owner.

For many situations, it will be desirable to force clients to stand-up their own Jupyter server co-located with some Thredds server in a way that allows the client to pay for the cost of the Jupyter server. Cloud computing is the obvious approach. Clients will pay for their own virtual machine that is as "close" to the Thredds server as their cloud system will allow. The client can then use their own Jupyter server on their own virtual machine to do the necessary computations and for which they will be charged.

Planned Activities

A preliminary demonstraion of communication between Thredds and Jupyter was created by Ryan May under the auspices of the ODSIP grant [7] funded from the NSF Earthcube program.

We anticipate starting from the ODSIP base demonstration and extending it over time. Subject to revision, the current plan involves the following steps.

Step 1. Servlet-Based Access

The first step is to build the servlet front-end. This may be viewed as a stripped-down mimic of IPython. This servlet will support both forms-based access as well as programmatic access.

Step 2. Operators

An initial set of operator libraries will need to be collected so that testing, experimentation, and tutoring can proceed. This will be an ongoing process. One can hope that some form of repository can be established and that a critical mass of operators will begin to form.

Step 3. Configuration

The next step is to make it possible, though not necessarily easy, for others to stand up their own Jupyter + Thredds. One approach would be to create a set of Docker instructions for this purpose. This would allow others to directly instantiate the Docker container as well as provide a recipe for non-Docker operation.

Step 4. Examples

As with any new system, external users will have difficulties in using it. So a variety of meaningful examples will need to be created to allow at least cutting-edge users to begin to experiment with the system. Again, this will be an on-going activity.

Step 5. Access Controls

At least an initial access control regime cannot be delayed for very long. Some external users can live without this in the short term. But for more widespread use, the users must have some belief in the security of the systems that they create. As with operators, this will be an on-going process.

Step 6. Workshops and Tutorials

At some point, this approach must be presented to the larger community. For Unidata, this is usually done using our Workshops. Additionally, video tutorials and presentations will need to be created.

References

[1] https://en.wikipedia.org/wiki/IPython#Project_Jupyter
[2] https://github.com/Unidata/siphon
[3] https://www.unidata.ucar.edu/software/thredds/v4.3/tds/tds4.3/reference/Services.html [4] https://unidata.github.io/siphon/ [5] https://github.com/nco/pynco [6] https://nco.sourceforge.net/ [7] https://www.nsf.gov/awardsearch/showAward?AWD_ID=1343761

Proposed Thredds Architecture Changes for OSGI/JigSaw

2017-05-25T15:44:52+00:00

This post provides some preliminary ideas on the consequences of moving TDS to use OSGI or JigSaw.

Assumptions:

OSGI and Jigsaw will be sufficiently similar so that this proposal with work with either with some tweeks.
Initial target is Thredds server
We will want to dynamically load at least the following kinds of things on the server.
- OSPs (e.g. netcdf4, grib, etc)
- RAFs (e.g. S3 and HDFS)
- Services (e.g. DAP4) I will refer to these all generically as "bundles" (OSGI terminology)

The loading process could be either:

lazy - load only when actually requested
eager- load at startup to provide a specifically configured TDS starting with a skeleton TDS.

For the eager case, we can assume that some config file (e.g. ThreddsConfig.xml) contains the information needed to dynamically extend the tds to make various bundles available.

For the lazy case, it must be possible to create a "signal" that some bundle is needed and must be preloaded. I can see two obvious ways to do this.

Stubs -- we provide stub classes for all the bundles so that calling the stub API the first time causes the bundle to be loaded and then used from then on.
Explicit -- any user of a bundle must explicitly invoke some code to load the required bundle.

My current inclination is to use the eager approach since it is simpler and still allows us to keep a small footprint .war file.

Another question is: where are the bundles stored? I assume they are not kept in the .war file since that would defeat one of the purposes of using dynamic loading. I presume there would be some default repository(s) plus a configurable set of additional repositories from which bundles can be pulled. It may be that NEXUS is usable for this purpose.

A note on IOSPs. Currently the IOSP to use is determined by calling a method that looks at a RAF wrapping a file. This method decides if itcan process that associated file. If we were to use lazy loading, it is probable that for IOSP's we would need to divide the IOSP into two parts: one for testing applicability and one for processing. This is an argument for using eager loading.

New TDS Cloud Architectures: Proposal 1

2016-09-15T20:39:30+00:00

The Thredds Data server (TDS) was designed to operate in a client-server architecture. Recently, Unidata has moved TDS into the cloud using its existing architecture.

There seems to be agreement inside Unidata that we need to begin rethinking that architecture to adapt to the realities of the cloud.

Proposal 1

This (first) proposal makes an assumptions about the nature of the cloud, and especially as it is likely to be in the near future.

This Assumption is that rather than having large quantities of data behind a (TDS) server, all data will be stored in cloud storage such as Amazon S3 or Azure blobs.

Secondarily, in such an environment, TDS cannot be aware of all data because it the set of all data is likely to be growing at a fast rate and by organizations not known to a given TDS server.

In this environment, the role of TDS becomes more of a locator and transformer of data. That is, TDS is must be made aware of some datasets and then it must apply various computations on that data to produce new derived data and then publish it into cloud storage.

Some consequences:

Unidata may have to get into the data discovery business; somthing it has tended to avoid so far.
The new TDS must be organized so that others can extend its capabilities by providing new kinds of computation models.
It is not clear if protocols such as DAP2, DAP4, CDMremote, etc. will be needed any longer because clients will be able to access the computed products using the S3 or Blob interfaces. In effect, streaming becomes replaced with the reification of computations into a file in S3/Blob.
Asynchronous computations more or less fall out of this proposed architecture if it possible for a client to poll S3/Blob for some dataset or for getting an event notification from the cloud.
Standardized file formats now become important than ever. The primary such formats for atmospherics is, I believe, netcdf3 and netcdf4. The HDF5 format is likely to also become more important, although its complexity vis-a-vis netcdf-4 will IMO hold it back.

Some questions:

Is there room for another (or several) standard file formats?
Is it possible to define a wrapper API for S3 and Azure blobs and whatever google and other cloud companies provde? This API would help clients having to lock in on a single provider?
What is the relation between this proposal and, say, Amazon lambda, or microservices?

[9/16/2016]

Notes on Services to be Provided

Catalog

Our current catalog system assumes that there is some set of dataset over which we have control and knowledge. As a rule, that set is the set of datasets on the Thredds server machine.

Under this proposal, this becomes less true. There may be no such set. Let us propose instead that we provide an umbrella catalog for which others can ask to have their datasets included. Additionally, others might ask to have their catalogs grafted onto our catalog tree. In any case, we are effectively talking about a federated catalog.

The value added is that we become the place to go to locate datasets. A consequence is that it becomes incumbent on us to:

Make searching our catalogs easy and support sophisticated searches.
Provide our catalog in a variety of formats, such as in the form of a set of relational tables.
Provide the ability to crack datasets to obtain additional information for our catalogs.

CDM

We also need to think about the role of CDM in this proposal. currently, CDM is our UNCOL (historical reference) in that CDM is the common model that allows us to separate the dataset format from the users of that dataset. That is, an IOSP maps some data format to CDM and then tools can be defined in terms of CDM to avoid having to know about all the actual data formats. This is a very powerful approach and we should not discard it.

Subset Services

Data subsetting services, in the form of NCSS and the dap(2,4) constraint languages is an additional service we provide that will continue to be important in any new architecture. In fact, I think that pulling this out as a set of services would be enhanced with this architecture. [Needs more thought].

[More thoughts will be added as they occur to me]

Upload and Download Support for TDS

2016-08-23T19:57:54+00:00

For version 5.0.0, it is possible to configure TDS to support the uploading and downloading of files into the local file system using the "/thredds/download" url path. This is primarily intended to support local File materialization for server-side computing. The idea is that a component such asJupyter can materialize files from TDS to make them available to code being run in Jupyter. Additionally, any final output from the code execution can be uploaded to a specific location in the TDS catalog to make it available externally.

Note that this functionality is not strictly necessary since it could all be done on the client side independent of TDS. It is, however, useful because the client does not need to duplicate code already available on the TDS server. This means that this service provides the following benefits to the client.

It is lightweight WRT the client
It is language independent

Assumptions

The essential assumption for this service is that any external code using this service is running on the same machine as the Thredds server,or at least has a common file system so that file system operations by thredds are visible to the external code.

An additional assumption is that "nested" calls to the Thredds server will not cause a deadlock. This is how access to non-file datasets (e.g. via DAP2 or DAP4 or GRIB or NCML) is accomplished. That is, the download code on the server will do a nested call to the server to obtain the output of the request. Experimentation shows this is not currently a problem.

Supported File Formats

Currently the dowload service supports the creation of files in two formats:

Netcdf classic (aka netcdf-3)
Netcdf enhanced (aka netcdf-4)

Download Service Protocol

A set of query parameters control the operation of this service. Note that all of the query parameter values (but not keys) are assumed to be url-encoded (%xx), so beware. Also, all return values are url-encoded.

Request and Reply

Invoking this service is accomplished using a URL pattern like this.

https://host:port/thredds/download/?key=value&key=value&...

In all cases, the reply value for the invocation will be of this form.

key=value&key=value&...

The specific keys depend on the invocation.

Defined Requests

The primary key is request. It indicates what action is requested of the server.

The set of defined values for the request key are as follows.

download
inquire

Request Keys Specific to "request=download"

format -- This specifies the format for the returned dataset; two values are currently defined: netcdfd3 and netcdf4.
url -- This is a thredds server url specifying the actual dataset to be downloaded.
target -- This specifies the relative path for the downloaded file. If the file already exists, it will be overwritten. Any leading directories will be created underneath downloaddir (see below).

Reply Keys Specific to "request=download"

download -- The absolute path of the downloaded file. In all cases, it will be under the downloaddir directory.

Request Keys Specific to "request=inquire"

inquire -- This specifies a semi-colon separated list of keys whose value is desired. Currently, the only defined key is downloaddir, which returns the absolute path of the download directory. All downloaded files will be placed under this directory.

Reply Keys Specific to "request=inquire"

downloaddir -- The absolute path of the directory under which all downloaded files are placed.

Upload Service Protocol

File upload is not handled directly by calling the Thredds server. Rather, it is handled by creating a directory that is to be scanned by the Thredds server to be made available at a specific point in the standard catalog.

Thredds Server Configuration

In order to activate upload and/or download, one or both of the following Java -D flags must be provided to the Thredds server.

-Dtds.download.dir -- Specify the absolute path of a directory into which files will be downloaded.
-Dtds.upload.dir -- Specify the absolute path of a directory into which files may be uploaded.

Security concerns (see below) must be addressed when setting the permission on these directories.

In order to complete the establishment of an upload directory, the following entry must be added to the catalog.xml file for the Thredds server.

<datasetScan name='Uploaded Files' ID='upload' location='${tds.upload.dir}' path='upload/'>      <metadata inherited='true'>        <serviceName>all</serviceName>        <dataType>Station</dataType>      </metadata>  </datasetScan>

Optionally, if one wants to make the download directory visible, the following can be added to the same file.

<datasetScan name='Downloaded Files' ID='download' location='${tds.download.dir}' path='download/'>      <metadata inherited='true'>        <serviceName>all</serviceName>        <dataType>Station</dataType>      </metadata>  </datasetScan>

Security Issues

It should be clear that providing upload and download capabilties can introduce security concerns.

The primary issue is that this service will cause the Thredds server to write into user-specified locations in the file system. In order to prevent malicious writing of files, the download directory (specified by tds.download.dir) should be created in a safe place. Typically, this means it should be placed under a directory such as "/tmp" on Linux or an equivalent location for other operating systems.

This directory will be read and written by the user running the Thredds server, typically "tomcat". The best practice for this is to create a specific user and group and set the download directories user and group to those values. Then the appropriate Posix permissions for that directory should be "rwxrwx---". Finally, the user "tomcat" should be added the created group.

Corresponding concerns apply to the upload directory and so its owner, group, and permissions should be set similarly to the download directory.

The url used to specify the dataset to be downloaded also raise security concerns. The url is tested for two specific url patterns to ensure proper behavior.

The pattern".." is disallowed in order to avoid attempts to escape the thredds sandbox.
The pattern"/download/" is disallowed in order to prevent an access loop in which a download call attempts to call download again.

In order to provide additional sandboxing, the url provided by the client is modified to ignore the host, port and servlet prefix. They are replaced with the "<host>:<port>/thredds" of the thredds server. This is to prevent attempts to use the thredds server to access external data sources, which would otherwise provide a security leak.

Finally, it is desirable that some additional access controls be applied. Specifically, Tomcat should be configured to require client-side certificates so that all clients using this service must have access to that certificate.

Examples

Example 1: Download a file (via fileServer protocol)

request:

https://localhost:8081/thredds/download/?request=download&format=netcdf3&target=nc3/testData.nc3&url=https://host:80/thredds/fileServer/localContent/testData.nc&testinfo=testdirs=d:/git/download/tds/src/test/resources/thredds/server/download/testfiles

reply:

download=c:/Temp/download/nc3/testData.nc3

Note: the encoded version of the request:

https://localhost:8081/thredds/download/?request=download&format=netcdf3&target=nc3%2FtestData.nc3&url=http%3A%2F%2Fhost%3A80%2Fthredds%2FfileServer%2FlocalContent%2FtestData.nc&testinfo=testdirs%3Dd%3A%2Fgit%2Fdownload%2Ftds%2Fsrc%2Ftest%2Fresources%2Fthredds%2Fserver%2Fdownload%2Ftestfiles

Example 2: Download a DAP2 request as a NetCDF-3 File