« Implementing Thread-... | Main | MetPy Mondays #1 -... »

05 June 2017

Initial Draft: 2017-5-28
Last Revised: 2017-6-5
Author: Dennis Heimbigner, Unidata

Introduction
The Alternative: Jupyter
The Language: Python
The Notional Architecture
Accessing the Jupyter Server
Asynchronous Operation
Thredds Value Added
Specialized Capabilities
Access Controls
Resource Controls
Planned Activities
References

Introduction

For a number of years, the Unidata Thredds group has been in the process of "implementing" server-side computation Real-Soon-Now (as the saying goes).

Server-side computing embodies the idea that it is most efficient to physically co-locate a computation with the datasets on which it is operating. As a rule, this meant having a server execute the computation because the dataset was controlled by that server. Server-side computing servers for the atmospheric community have existing in various forms for a while now: GRADS, DAP2 servers, and ADDE, for example.

One -- and perhaps The -- major stumbling block to server-side computing is defining and implementing the programming language in which the computation is coded. In practice, server-side systems have developed their own language for this purpose. This is a problem primarily because it is very difficult to define and implement a programming language. Often the "language" started out as some form of constraint expression (e.g. DAP2, DAP4, and ADDE). Over time, it would accrete other capabilities: conditionals, loops, etc. In time, it grew into a badly designed but more complete programming language. Since it was rarely implemented by language/compiler experts, it usually was quirky and presented a significant learning curve for users.

The advantage to using such a home grown language was that it could be tailored to the dataset models supported by the server. It also allowed for detailed control of programs. This made certain other issues easier: access controls and resource controls, for example.

The author recognized the language problem early on and was reluctant to go down that path. As the primary "pusher" for server-side computing at Unidata, this has delayed implementation for an extended period.

The Alternative: Jupyter

Fortunately, about three years ago, project Jupyter [1] was created as an offshoot of the IPython Notebook system. It provided a multi-user, multi-language compute engine in which small programs could be executed. With the advent of Jupyter, IPython then refactored its computation part to use Jupyter.

From the point of view of Unidata, Jupyter provides a powerful alternative to traditional server-side computing. It supports multiple, "real" programming languages. It is a server itself, so it can be co-located with an existing Thredds server. And, most importantly, it is designed to execute small programs written in any of its supported languages.

In the rest of this document, the term "program" will, as a rule, refer to programs executing within a Jupyter server.

The Language: Python

In order to avoid the roll-your-own language problem, it was decided to adopt wholesale an existing modern programming language. This meant that the language was likely to be complete right from the start. Further, the learning curve would be reduced because a significant amount of supporting documentation and tutorials would be available.

We have chosen Python as our preferred language. We made this choice for several reasons.

Python is rapidly being adopted by the atmospheric sciences community as its language of choice.
There is a very active community that is developing packages for use by the scientific community and more specifically for the atmospheric sciences community. Examples are numerous, including numpy, scipy, metpy, and siphon.
It is one of the languages supported by Jupyter.

To the extent that Jupyter supports other languages, it would be possible to write programs in those languages. However, I would not expect Unidata to expend any significant resources on those other languages. The one possible exception is if/when Jupyter supports Java.

The Notional Architecture

Notional Architecture

The notional architecture we now espouse is shown in Figure 1. Basically, a standard Thredds server runs along side a Jupyter server. A program executing in the Jupyter server has access to the data on the Thredds server either using the file system or using some streaming protocol (e.g. DAP2). File access is predicated on the assumption that the two servers are a co-located and share a common file system.

The Thredds server currently requires and uses some form of servlet engine (e.g. Tomcat). We exploit that to provide a front-end servlet to act as intermediary between a user and the Jupyter server (see below).

So now, instead of sending a program to the Thredds server, it is sent to the Jupyter server for execution. That executing program is given access to the Thredds server using a variety of packages (e.g. Siphon [2]). Once its computation is completed, its resulting products can be published within a catalog on Thredds to make it accessible to user programs. Once in the catalog. that product can be accessed by external clients using existing streaming protocol services. In some cases, it may also be possible to access that product using a shared file system.

This discussion assumes the existence of a single Jupyter server, but it will often be desirable to allow mutltiple such servers. Examples of the utility of multiple servers will be discussed in subsequent sections.

Accessing the Jupyter Server

Access to the Jupyter server will be supported using several mechanisms. Each mechanism has a specific use case.

IPython Access

Though not shown in Figure 1, it is assumed that existing IPython access to Jupyter is available. This path is, of course, well documented elsewhere in the IPython+Jupyter literature.

Web-based Access

Another use-case is to provide access for scientists with limited programming skills or for other users requiring simple and occasional computations.

The servlet box in Figure 1 illustrates this. For this case. Client web browsers would carry out forms based computations via the front-end servlet running under some Apache Tomcat (other other servlet engine).

Programmatic Access

Scientists will still write standalone programs that need to process computed data. Others will write value-added wrapper programs to provide, for example, additional capabilities such as plotting or other graphical presentation.

These use cases will require the ability to upload and execute programs from client-side programs. The simplest approach here is to build on the web-based version. That is, the client side program would also access the servlet, but using a modified and stream-lined interface.

Asynchronous Operation

Some computations will take a significant amount of time to complete. Submitting such a computation through the Thredds server interface is undesirable because it requires either blocking of the client for long periods of time or complicating the Thredds server to make it support asynchronous execution. The latter usually involves returning some kind of token (aka future) to the client that it can interrogate to see if the computation is complete. Or alternatively, providing some form of server to client event notication mechanism. In any case, such mechanisms are complicated to implement.

Direct client to Jupyter communication (see previous section) can provide a simple and effective alternative to direct implementation of asynchronous operation. Specifically, the client uploads the program via IPython or via a web browser to the Jupyter server. As part of its operation the program uploads its final product(s) to some catalog in the Thredds server. The client is then responsible for detecting that the product had been uploaded, which then enables further processing of that product as needed.

Thredds Value Added

Given the approach advocated in this document, on what should Unidata focus to support this approach.

Accessing Thredds Data

First and foremost, we want to make it easy, efficient, and fast for programs to access the data within a co-located Thredds server.

Thredds currently provides a signficant number of "services" [3] through which metadata and data can be extracted from a Thredds server. These include at least the following: DAP2 (OpenDAP), DAP4, HTTPServer, WCS, WMS, NetcdfSubset, CdmRemote, CdmrFeature, ISO. NCML, and UDDC.

The cost to access to data via some commonly supported protocols, such as DAP2 or CdmRemote, is relatively independent of co-location, so using such protocols is probably not the most efficient method.

File Download

The most efficient inter-server communication is via a shared file system accessible both to the Thredds server and the Jupyter server.

As of Thredds 5 it is possible to materialize both datasets and (some kinds of) streams as files: typically netcdf-3 (classic) or netcdf-4 (enhanced). One defines a directory into which downloads are stored. A special kind of request is made to a Thredds server that causes the result of the query to be materialized in the specified directory. The name of the materialized file is then returned to the client.

Siphon

The Siphon project [2,4] is designed to wrap access to a Thredds server using a variety of Thredds services. As such, it will feature prominently in our system. Currently, siphon supports the reading of catalogs, and data access using the Thredds netcdf subset service (NCSS), CdmRemote, and Radar Data.

Operators

The raison d'etre of server side computation is to input datasets, apply operators to them and produce new product datasets. In order to simplify this process, it is desirable to make available many high-level operators so that a computation can be completed by the composition of operators.

Often, server-side computation is illustrated using simple operations such as sum and average. But these kinds of operators are likely to only have marginal utility; they may be useful, but will not be the operators doing the heavy lifting of server side computation.

Accumulating useful operators is possibly another place where Unidata can provide added value. Unidata can both provide a common point of access, as well as providing some form of vetting for these operators.

One example is Pynco [5]. This is a Python wrapping of the netCDF Operators (NCO) [6]. NCO is currently all command line, so Pynco wraps them to allow programmatic invocation of the various operators.

As part of the operator support, Unidata might wish to create a repository (using conda channels or Github?) to which others can contribute.

Publication (File Upload)

When a program is executed within Jupyter, it will produce results that need to be communicated to others -- especially the client originating the computation. The obvious way to do this is to used the existing Thredds publication facilities, namely catalogs.

As of Thredds 5, it is possible to add a directory to some top-level catalog. Uploading a file into that directory causes it to appear in the specified catalog. Uploading can be accomplished either by file system operations or via a browser forms page.

Specialized Capabilities

Another way to add value is to make libraries available that support specialized kinds of computations.

GPU Support

The power of Graphics Processing Units (GPUs) has significantly increased over the last few years. Libraries now exist for performing computations on GPUs. To date, using a GPU on atmospheric data is uncommon. It should be possible improve the situation by making operators available that use a GPU underneath to carry out the computation.

Machine Learning Support

Artificial Intelligence, at least in the form of machine learning, is another example of a specialized capability. Again, use of AI to process atmospheric data is currently not common. It should be possible to build quite sophisticated subsystems supporting the construction of AI systems for doing predictions and analyses on such data.

Access Controls

There is a clear danger in providing a Jupyter server open to anyone to use. Such a server is a potential exploitable security hole if it allows the execution of arbitrary code. Further, there are resource issues when anyone is allowed to execute a program on the server.

Much of the support for access controls will depend on the evolving capabilities implemented by the Jupyter project. But we can identify a number of access controls that will be needed to protect a Jupyter server.

Sandboxing

The most difficult problem is to prevent the execution of arbitrary code on the Jupyter server. Effectively, such code must be sandboxed to control what facitilities are made available to executing programs. Two sub-issues arise.

Some Python packages must be suppressed. Arbitrary file operations and sub-process execution are two primary points of concern.
Various packages must be useable: numpy, metpy, siphon, for example. They are essential for producing the desired computational products. However, the security of the system depends on the security of those packages. If they provide accessible security-flaws, then security as a whole is compromised.

Authentication

Strong authentication mechanisms will need to be implemented so that only authorized users can utilize the resources of a Jupyter server. Jupyter authentication may need to be coordinated with the Thredds server so that some programs executed on Jupyter can have access to otherwise protected datasets on the Thredd server. This is one case where multiple Jupyter servers (and even multiple Thredds servers) may be needed to support specialized access to controlled datasets by using isolated Jupyter servers.

Resource Control Mechanisms

Uncontrolled execution of code can potentially be a significant performance problem. Additionally, it can result in significant costs (in dollars) being charged to the server's owner.

For many situations, it will be desirable to force clients to stand-up their own Jupyter server co-located with some Thredds server in a way that allows the client to pay for the cost of the Jupyter server. Cloud computing is the obvious approach. Clients will pay for their own virtual machine that is as "close" to the Thredds server as their cloud system will allow. The client can then use their own Jupyter server on their own virtual machine to do the necessary computations and for which they will be charged.

Planned Activities

A preliminary demonstraion of communication between Thredds and Jupyter was created by Ryan May under the auspices of the ODSIP grant [7] funded from the NSF Earthcube program.

We anticipate starting from the ODSIP base demonstration and extending it over time. Subject to revision, the current plan involves the following steps.

Step 1. Servlet-Based Access

The first step is to build the servlet front-end. This may be viewed as a stripped-down mimic of IPython. This servlet will support both forms-based access as well as programmatic access.

Step 2. Operators

An initial set of operator libraries will need to be collected so that testing, experimentation, and tutoring can proceed. This will be an ongoing process. One can hope that some form of repository can be established and that a critical mass of operators will begin to form.

Step 3. Configuration

The next step is to make it possible, though not necessarily easy, for others to stand up their own Jupyter + Thredds. One approach would be to create a set of Docker instructions for this purpose. This would allow others to directly instantiate the Docker container as well as provide a recipe for non-Docker operation.

Step 4. Examples

As with any new system, external users will have difficulties in using it. So a variety of meaningful examples will need to be created to allow at least cutting-edge users to begin to experiment with the system. Again, this will be an on-going activity.

Step 5. Access Controls

At least an initial access control regime cannot be delayed for very long. Some external users can live without this in the short term. But for more widespread use, the users must have some belief in the security of the systems that they create. As with operators, this will be an on-going process.

Step 6. Workshops and Tutorials

At some point, this approach must be presented to the larger community. For Unidata, this is usually done using our Workshops. Additionally, video tutorials and presentations will need to be created.