Unidata Developer's BlogUnidata Developer's Bloghttps://www.unidata.ucar.edu/blogs/developer/en/feed/entries/atom2024-03-05T10:00:34-07:00Apache Rollerhttps://www.unidata.ucar.edu/blogs/developer/entry/my-summer-with-java-implementingMy summer with Java: Implementing dataset enhancements on THREDDS Data ServerUnidata News2023-08-08T15:46:22-06:002023-08-09T09:00:18-06:00<div class="img_l" style="width: 150px;">
<img width="150" src="/blog_content/images/2023/20230614_jsouza.png" alt="Jessica Souza" />
<div class="caption">
Jessica Souza
</div>
<p></div></p>
<p>
During my internship, I worked with the Unidata THREDDS team. My intentions this
summer were to learn Java, improve my coding skills, and have experience using it in
real world applications. I began my journey by converting existing unit tests for the
netCDF-Java library, which is tightly linked to the THREDDS Data Server (TDS) code,
to the JUnit Java testing framework. Once I got this practice with Java and had a
working development environment, I was able to start working on my summer project.
</p>
<div class="img_l" style="width: 150px;">
<a class="lightbox" title="Jessica Souza" href="/blog_content/images/2023/20230614_jsouza.png">
<img width="150" src="/blog_content/images/2023/20230614_jsouza.png" alt="Jessica Souza" />
</a>
<div class="caption">
Jessica Souza
</div>
<p></div></p>
<p class="byline">
by
<a href="/community/internship/#2023js">Jessica Souza</a>
<br />2023 Unidata summer intern
</p>
<p>
During my internship, I worked with the Unidata THREDDS team. My intentions this
summer were to learn Java, improve my coding skills, and have experience using it in
real world applications. I began my journey by converting existing unit tests for the
netCDF-Java library, which is tightly linked to the
<a href="https://www.unidata.ucar.edu/software/tds/">THREDDS Data Server</a> (TDS) code,
to the JUnit Java testing framework. Once I got this practice with Java and had a
working development environment, I was able to start working on my summer project.
</p>
<p>
With the extensive increase in the use of machine learning models in Earth science
related research, my project was an initiative in the direction of providing new
datasets intended for machine learning use. Since Earth sciences has become
substantially data-driven, with a variety of forecast models, large model
simulations, and satellite missions, there is an unprecedented rise in raw
unprocessed data. When working with machine learning models, significant
preprocessing of the data is required; this involves cleaning, re-scaling, and
splitting the dataset. The goal of re-scaling is to transform features to be on a
similar range and improve the performance and training stability of the model. It is
not always that re-scaling is necessary, but it is essential when dealing with
multiple variables. My project focused on performing dataset preprocessing, in this
case re-scaling, before access by users targeting machine learning applications.
After reviewing 13 papers from the AMS journal <em>Artificial Intelligence for the
Earth Systems</em> (AIES), my Unidata mentor and I selected <em>standardization</em>
and <em>normalization</em> (common types of re-scaling) for implementation as part of
my project.
</p>
<div class="img_r" style="width: 200px;">
<a class="lightbox" title="Visualization options for preprocessed data served by the TDS - Jupyter notebook and Godiva3 examples." href="/blog_content/images/2023/20230807_jessica_VisualizationNotebooks_Thredds.png">
<img width="200" src="/blog_content/images/2023/20230807_jessica_VisualizationNotebooks_Thredds.png" alt="Data visualizations" />
</a>
<div class="caption">
Preprocessed data on TDS (click to enlarge)
</div>
<p></div></p>
<p>
I decided to implement two functions in Java based on the <code>StandardScaler</code>
and <code>MinMaxScaler</code> functions from <a href="https://scikit-learn.org/">Scikit-learn</a>, a Python machine
learning library. Using an external Java library suitable for large data streams
(<a href="https://commons.apache.org/proper/commons-math/">Apache Commons Mathematics Library</a>), I created the classes <code>Standardizer</code>
and <code>Normalizer</code>. Next, they were integrated into the netCDF-java codebase.
This process included creating constants/attributes in the <a href="https://docs.unidata.ucar.edu/netcdf-java/current/userguide/common_data_model_overview.html">Common Data Model</a> class
for the standardization and normalization, adding <code>Standardizer</code> and
<code>Normalizer</code> in the set of possible data enhancements, applying enhancements to the
data if "standardizer", or "normalizer" was a netCDF variable attribute and data was
floating point type. Using the new classes in the TDS through the <a href="https://docs.unidata.ucar.edu/netcdf-java/current/userguide/basic_ncml_tutorial.html">
NetCDF Markup Language</a> (NcML) allowed the creation of a virtual dataset that
could be returned to
the user without altering the original data or requiring additional disk usage. By
making these processed datasets available to TDS users, we reduce the amount of data
preprocessing required on the user end.
</p>
<p>
The initial datasets chosen for preprocessing on the THREDDS test server were forecast
(GFS) and satellite (GOES 18) data, due to their frequent use in the <em>AIES</em>
papers reviewed. In addition to adding a mechanism to access the preprocessed datasets
on the TDS test server, we included Jupyter notebooks for visualization of the
preprocessed variables. I also created automated tests to evaluate if the code
behavior was as expected, which involved unit testing and integration testing.
</p>
<p>
During the project, I also gained experience with GitHub in creating issues, pull
requests and code review. Furthermore, tests on the performance difference with the
use of the re-scaling were also evaluated. As next steps, the already reasonable
results of the performance tests could be improved and more datasets relevant to the
users could be provided.
</p>
<p>
This summer offered me invaluable personal and professional development
opportunities including the Unidata Users Workshop, Project Pythia Hackathon, and
the professional development workshops series for the UCAR interns. The combination
of all the experiences throughout the internship contributed to help me build
confidence in becoming a contributor to open source. Working on my project and the
dedicated support from my mentor and the THREDDS team has deepened my passion for
scientific software development.
</p>
https://www.unidata.ucar.edu/blogs/developer/entry/hacktoberfest-challenge-share-your-jupyterHacktoberfest challenge - share your Jupyter NotebooksHailey Johnson2020-10-07T15:17:24-06:002020-10-07T15:17:24-06:00<p>Unidata is looking for community contributions to the <a href="https://docs.unidata.ucar.edu/tds/5.0/userguide/customizing_tds_look_and_feel.html#jupyter-notebooks">Jupyter Notebook service</a>, which provides Jupyter Notebooks to facilitate accessing, exploring, and visualizing datasets in the TDS. Users who have written Notebooks that interface with TDS datasets (or Python scripts that can be converted) or are interested in creating one can contribute in one of two ways (or both!):</p>
<ol>
<li>Contribute a generic viewer that would be a useful in a typical THREDDS Data server (see the GitHub issue <a href="https://github.com/Unidata/tds/issues/117">here</a>).</li>
<li>Contribute a viewer for a specific dataset, data type, or catalog included in the Unidata THREDDS Data Server (see GitHub issue <a href="https://github.com/Unidata/TdsConfig/issues/93">here</a>).</li>
</ol>
<p>The issues above are eligible for <a href="https://hacktoberfest.digitalocean.com/">Hacktoberfest</a>: you get Hacktoberfest credit and help the Unidata community! For more information on contributing and some helpful links, visit the GitHub issues above.</p>
<p>Happy hacking!</p>
https://www.unidata.ucar.edu/blogs/developer/entry/my-summer-of-improving-theMy Summer of Improving the TDS Web InterfaceUnidata News2018-08-10T14:35:07-06:002018-08-10T14:35:07-06:00<div class="img_l" style="width: 150px;">
<img width="150" src="/blog_content/images/2018/20180608_hajohns_1_400.jpg" alt="Hailey Johnson"
/>
<div class="caption">
Hailey Johnson
</div>
<p></div></p>
<p class="byline">
by
<a href="https://www.unidata.ucar.edu/blogs/news/entry/welcome-summer-intern-hailey-johnson">Hailey Johnson</a>
<br />2018 Unidata summer intern
</p>
<p>During my time here at Unidata, I’ve focused on extending the THREDDS Data Server (TDS)
web interface and services. I spend the first few weeks of the summer redesigning the
interface to be more intuitive to end users and implementing UI changes using Thymeleaf
HTML templating. The new TDS catalog pages are designed with a “plug-and-play” structure,
allowing users to override or insert their own contributed HTML, which is processed
by a server-side Thymeleaf template resolver.
</p>
<p><link rel="stylesheet" type="text/css" href="/css/jquery/jquery.lightbox-0.5.css" media="screen" /></p>
<script type="text/javascript" src="/js/jquery/jquery.lightbox-0.5.min.js"></script>
<script type="text/javascript">
$(document).ready(function() {
$('a.lightbox').lightBox();
});
</script>
<!-- End Lightbox stuff -->
<div class="img_l" style="width: 150px;">
<a class="lightbox" title="Hailey Johnson" href="/blog_content/images/2018/20180608_hajohns_1_400.jpg">
<img width="150" src="/blog_content/images/2018/20180608_hajohns_1_400.jpg" alt="Hailey Johnson"
/> </a>
<div class="caption">
Hailey Johnson
</div>
<p></div></p>
<p class="byline">
by
<a href="https://www.unidata.ucar.edu/blogs/news/entry/welcome-summer-intern-hailey-johnson">Hailey Johnson</a>
<br />2018 Unidata summer intern
</p>
<p>During my time here at Unidata, I’ve focused on extending the THREDDS Data Server (TDS)
web interface and services. I spend the first few weeks of the summer redesigning the
interface to be more intuitive to end users and implementing UI changes using Thymeleaf
HTML templating. The new TDS catalog pages are designed with a “plug-and-play” structure,
allowing users to override or insert their own contributed HTML, which is processed
by a server-side Thymeleaf template resolver.
</p>
<div class="img_r" style="width: 200px;">
<a class="lightbox" title="THREDDS Catalog data pages now include a link to a pre-populated Jupyter notebook demonstrating data access in Python."
href="/blog_content/images/2018/20180801_devblog_intern_hailey_01.png">
<img width="200" src="/blog_content/images/2018/20180801_devblog_intern_hailey_01.png" alt="TDS Catalog"
/> </a>
<div class="caption"> (click to enlarge) </div>
</div>
<p><a class="lightbox" title="Notebooks demonstrate how to use Siphon to retrieve a dataset."
href="/blog_content/images/2018/20180801_devblog_intern_hailey_02.png"></a></p>
<p>Halfway through the summer, I thought that it would be useful if every dataset page in
the TDS generated a code snippet demonstrating how to access that dataset and its metadata
using Siphon. This idea rapidly escalated into a TDS Jupyter Notebook service, which
returns a Jupyter Notebook (ipynb) pre-populated with a dataset ID and catalog URL.
The default Notebook describes and demonstrates use of the
<a href="https://www.unidata.ucar.edu/software/siphon/">Siphon</a> remote access protocol and uses notebook widgets to allow users to explore
variables within the dataset. Users can supply other Notebook files in place of (or
in addition to) the default Notebook. Contributed Notebooks can be registered as valid
for all datasets or a subset using the ipynb metadata; the mapping between Notebooks
and datasets can be specified by dataset IDs, parent catalogs, or data type (e.g. grid,
point). (Administrators of TDS version 5.0 and later will find information on tailoring
the notebooks in the section of the documentation titled "Extending TDS Services.")
</p>
<p>You can find my contributions to the TDS live on Unidata's
<a href="https://thredds-dev.unidata.ucar.edu/thredds/catalog/catalog.html">thredds-dev data server</a>!
</p>
<p>The goal of these new features is to help lower the barrier-to-entry and enable a broader
community of users to access data using the TDS. I’ve enjoyed the work I’ve done at
Unidata this summer would like to continue working on problems with data management
and access in the geoscience in my future research and career.
</p>
https://www.unidata.ucar.edu/blogs/developer/entry/thredds-licence-changeTHREDDS License ChangeSean Arms2018-02-12T12:22:47-07:002018-02-12T13:33:27-07:00<div class="img_l" style="width: 100px;">
<img width="100" src="/images/logos/thredds_netcdf-150x150.png" alt="TDS" />
</div>
<p>As we approach the first public beta of version 5.0 of the THREDDS Data Server (TDS),
we have decided to revisit our software license. Currently, both NetCDF-Java and the
TDS are released under the same license that the netCDF C library uses, which is a
license that was "home grown" at UCAR. It's usually called an "MIT-style license,"
though it is perhaps more similar to the BSD-3 Clause license. Rather than continue
to use the "home grown" license, we will be moving to a standard, off-the-shelf BSD-3
license, bringing the TDS and NetCDF-Java packages more in line with standard practice
within the Open Source community.
</p>
<div class="img_l" style="width: 100px;">
<img width="100" src="/images/logos/thredds_netcdf-150x150.png" alt="TDS" />
</div>
<p>As we approach the first public beta of version 5.0 of the THREDDS Data Server (TDS),
we have decided to revisit our software license. Currently, both NetCDF-Java and the
TDS are released under the same license that the netCDF C library uses, which is a
license that was "home grown" at UCAR. It's usually called an "MIT-style license,"
though it is perhaps more similar to the BSD-3 Clause license. Rather than continue
to use the "home grown" license, we will be moving to a standard, off-the-shelf BSD-3
license, bringing the TDS and NetCDF-Java packages more in line with standard practice
within the Open Source community.
</p>
<h3>What This Means For You</h3>
<p>If you are a user of the TDS, this change in licensing does not affect you.</p>
<p>If you are a developer who includes technology from the TDS or netCDF-Java in your own work, you will
find that the new license is slightly less restrictive than the current license, in
that it does not explicitly request that you credit UCAR/Unidata in publications that
result from your use of the technology. Of course, we still appreciate any acknowledgement
you do provide; such credits are very useful to Unidata in making the case to our funders
that the work we do is valuable to our community. If you find Unidata’s contributions
to community cyberinfrastructure useful, please consider
<a href="https://www.unidata.ucar.edu/community/index.html#acknowledge">citing the technology</a> you’re
using, or perhaps including a
<a href="https://www.unidata.ucar.edu/images/logos/badges/badges.html#home">badge</a>
in your online materials.
</p>
<h3>What is Changing</h3>
<p>Here is the text of the current license: </p>
<pre>
Copyright 1998-2015 University Corporation for Atmospheric Research/Unidata
Portions of this software were developed by the Unidata Program at the
University Corporation for Atmospheric Research.
Access and use of this software shall impose the following obligations
and understandings on the user. The user is granted the right, without
any fee or cost, to use, copy, modify, alter, enhance and distribute
this software, and any derivative works thereof, and its supporting
documentation for any purpose whatsoever, provided that this entire
notice appears in all copies of the software, derivative works and
supporting documentation. Further, UCAR requests that the user credit
UCAR/Unidata in any publications that result from the use of this
software or in any product that includes this software. The names UCAR
and/or Unidata, however, may not be used in any advertising or publicity
to endorse or promote any products or commercial entity unless specific
written permission is obtained from UCAR/Unidata. The user also
understands that UCAR/Unidata is not obligated to provide the user with
any support, consulting, training or assistance of any kind with regard
to the use, operation and performance of this software nor to provide
the user with any updates, revisions, new versions or "bug fixes."
THIS SOFTWARE IS PROVIDED BY UCAR/UNIDATA "AS IS" AND ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL UCAR/UNIDATA BE LIABLE FOR ANY SPECIAL,
INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING
FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION
WITH THE ACCESS, USE OR PERFORMANCE OF THIS SOFTWARE.
</pre>
<p>Here is the new license:</p>
<pre>
Copyright 1998-2018 University Corporation for Atmospheric Research/Unidata
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software without
specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER
OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
</pre>
https://www.unidata.ucar.edu/blogs/developer/entry/event-notification-for-thredds-serversEvent Notification for Thredds ServersDennis Heimbigner 2017-08-07T13:35:42-06:002017-08-07T13:35:42-06:00<p>Periodically some of the Thredds servers run by Unidata get
seriously overloaded. One cause is because external users
poll the Thredds server to see what has changed. If the polling
rate is too high then the performance of the Thredds server can
seriously deteriorate.</p>
<p>I am proposing here to mitigate this problem by allowing
Thredds servers to generate events that signal changes that
might be of interest to users. Then, instead of polling,
these users can watch for specific changes events and use
that information to update their local databases (or whatever).</p>
<p><strong>Initial Draft</strong>: 2017-08-05 <br />
<strong>Last Revised</strong>: 2017-08-05 <br />
<strong>Author</strong>: Dennis Heimbigner, Unidata </p>
<h3>Table of Contents</h3>
<ul>
<li><a href="#Introduction">Introduction</a></li>
<li><a href="#Architect">Architecture</a></li>
<li><a href="#Requirements">Requirements</a></li>
<li><a href="#InitialEvents">Initial Event Set</a></li>
<li><a href="#TopicSpaceDesign">Topic Space Design</a></li>
<li><a href="#Implementation">Implementation</a></li>
<li><a href="#LDM">Relation to LDM</a></li>
<li><a href="#Performance">Performance Costs</a></li>
<li><a href="#Security">Security</a></li>
<li><a href="#Summary">Summary</a></li>
<li><a href="#AppendixA">Appendix A. Miscellaneous Notes</a></li>
</ul>
<h2><a name="Introduction">Introduction</a></h2>
<p>Periodically some of the Thredds servers run by Unidata get
seriously overloaded. One cause is because external users
poll the Thredds server to see what has changed. If the polling
rate is too high then the performance of the Thredds server can
seriously deteriorate.</p>
<p>I am proposing here to mitigate this problem by allowing
Thredds servers to generate events that signal changes that
might be of interest to users. Then, instead of polling,
these users can watch for specific changes events and use
that information to update their local databases (or whatever).</p>
<p>The cost tradeoff for Unidata is the cost of periodic
"hammering" versus the maintainence of an event server
to distribute change events to users.</p>
<p>Looking ahead, it is also possible that this proposal can
facilitate inter-server communications. This means that multiple
Thredds servers could communicate useful information. This is
speculative for now, but should be kept in mind.</p>
<h2><a name="Architect">Architecture</a></h2>
<p>I are proposing a pretty standard publish-subscribe
system for use by a Thredds server. In this architecture,
there are hooks in various places in the Thredds code
that send short messages to a separate "broker" server.</p>
<p>On the client (user) side, each client registers with the broker
to tell it the kinds of messages in which it is interested.</p>
<p>So the flow is:</p>
<ol>
<li>The server generates a change message</li>
<li>The message is received by the broker</li>
<li>The broker forwards the message to all
clients that are registered as interested
in that kind of message.</li>
</ol>
<h2><a name="Requirements">Requirements</a></h2>
<p>In order to be useful to Unidata and its community,
I require certain capabilities for the event system.</p>
<h3>Topic-based Messages</h3>
<p>In event systems, there are typically two ways to identify
messages: by queue and by topic.</p>
<p>A topic based message is one that has an associated structured string
used to classify the message. Often, the structure of the string
is a tree represented by the format <em>field.field.field...</em>
where each field is some identifier. This format can be used, for example,
to mark the message as referring to some file in a tree structured file system.
Thus a file <em>/f1/f2/f3</em> might be mapped to the topic <em>f1.f2.f3</em>.</p>
<p>A queue-based identification is one in which a message is sent to
a specific named queue. It is isomorphic to a topic system as far
as a sender is concerned because each distinct topic string can
be the name of a queue. I will not consider queue-based system further.</p>
<h3>Topic Wildcards</h3>
<p>On the client side, the client must be able to register for
messages by specifying a pattern indicating the message topics
in which it is interested. It is desirable to allow a client
to register for a number of different topics by specifying a pattern
containing wildcards (as is common in e.g. Unix file specifications).</p>
<p>If, for example, our client was interested in events about all files
within <em>/f1/f2</em>, it should be possible to specify a topic pattern
such as "f1.f2.*".</p>
<h3>Durability</h3>
<p>Suppose a client is not active or not registered with the broker
at the time an event is received by the broker.
If later the client registers, it will not see that
previously generated message. This is a problem because
a client will be forced to again access the Thredds
server to see what happened while it was offline.</p>
<p>To deal with this problem, I require that our broker support
"durable" messages. In the event community, this means that
the broker will cache messages for some period of time. When a client
registers, it will receive any cached messages that match its pattern.
Supporting durability is tricky because of issues such as message
retention policies and message duplication. Nontheless, this is
an essential requirement in order to avoid polling as much as possible.</p>
<h3>Persistence (Optional)</h3>
<p>In this context, persistence means that cached messages are maintained even
if the broker crashes. It is closely related to durability if the
durable messages are stored in some kind of file-based database system.
I do not require persistence, although it may come for free with
some brokers.</p>
<h3>Transactions (Not Required)</h3>
<p>Transactions in event systems are similar to those of database systems.
It essentially means that a message is guaranteed to be delivered
or it appears as if it is not delivered at all. It is unlikely that
transactions will be required for our event system.</p>
<h2><a name="InitialEvents">Initial Event Set</a></h2>
<p>The initial set of events of interest will be catalog
change events. Specifically these.
* Insert - when a catalog has a new entry
* Delete - when a catalog has an existing entry removed
* Modify - when a catalog has an existing entry modified.</p>
<p>Modification is tricky since it effectively tested by looking at
the modification date for a file. This may not be an actual
change. Note also that it may turn out that modifications will
actually look like a deletion followed by an insertion.</p>
<p>Note that modification false positives are generally acceptable
if not too frequent. This is because the cost to the client is
that it retrieves a catalog entry that has in fact not been modified.</p>
<p>False negatives are much less acceptable. This will occur if
an insertion, deletion, or modification occurs but no corresponding
event is generated. If a client cannot trust the server in this
regard, then the event system will not be used.</p>
<p>Note also that change events should also apply to the creation of new catalogs
as well as the files within a catalog.</p>
<h2><a name="TopicSpaceDesign">Topic Space Design</a></h2>
<p>Topic Space Design is an important aspect of this proposal.
That is, we need to set of a topic tree such that
clients can specify what they want with a reasonable degree of
specificity.</p>
<p>From LDM, we know that this is important in order to avoid
clients just subscribing to everything. I would hope that
using a wildcard system -- as opposed to e.g.
regular expressions -- is sufficiently simple that clients
will not be tempted to ask for everyting. Realistically,
this is probably a forlorn hope because it is likely
that pollers want to know everything about a specific server.</p>
<p>My initial thought is that the root of our topic space
is "Unidata.Thredds". From there, I would like to specify
a particular server via a DNS name + port. There is a problem
since DNS names contain dots. It may be necessary to use
encoding, the url %xx for example,
to change the dots and semicolons in the DNS name
So one might say "Unidata.Thredds.motherlode%2eucar%2eedu%3a8080."</p>
<p>From there, the obvious choice is to encode the catalog path
as the rest of the topic. So, for example:</p>
<pre>
Unidata.Thredds.thredds%2eucar%2eedu%3a8080.catalog.grib.NCEP.GEFS.Global%2e1p0deg%2eEnsemble.members-analysis.GEFS%2eGlobal%2e1p0deg%2eEnsemble%2eana%2e20170723%2e0000.grib2.*
</pre>
<p>The character escaping issues needs some thought.</p>
<h2><a name="Implementation">Implementation</a></h2>
<p>Currently, Thredds does not immediately detect changes
to its underlying file system. Rather, it dynamically
rebuilds some catalogs when accessed. The dynamically
generated catalog will, of course, reflect changes to that
catalog since it was last accessed.</p>
<p>John Caron left nascent code (CatalogWatcher) in Thredds
to actually operate at the time a change occurred (as opposed
to when a catalog is retrieved). This is more or less the
eager vs lazy issue.</p>
<p>So, in order to make this proposal work, I need to do (at least)
the following.
* Complete the existing code pieces such as CatalogWatcher
* Add code to convert a file system reference to a catalog
system url.
* Add an event sender to Thredds</p>
<p>This appears to be a straightforward set of modifications.
Of course Murphy's law says that I am forgetting something.</p>
<h2><a name="LDM">Relation to LDM</a></h2>
<p>My current broker of choice is Apache ActiveMQ.
But I cannot help but notice that LDM functionality is related
to this proposal. So is fair to ask if LDM could be adapted
to serve as the broker. Here are some issues that would need to
be considered. Note that my knowledge of the current capabilities
of LDM may be out of date.</p>
<ol>
<li>Message Size: LDM ships files, not short messages. In this sense,
LDM is overkill.</li>
<li>Volume: my speculation is that the volume of small messages
would not be all that large; it might even be similar to
the volume of distinct files shipped by LDM</li>
<li>Multiple Languages: whatever broker we use, it must be possible
to write clients in a variety of programming languages: C, Python,
Java at least.</li>
</ol>
<p>While it would be nice to make use of other Unidata technology,
I currently do not plan to pursue this path. As noted, my
current target is an ActiveMQ broker with JMS publishers and subscribers.</p>
<h2><a name="Performance">Performance Costs</a></h2>
<p>There is a performance cost in having the Thredds server generate
events. But it is difficult to assess these costs without knowing
how many events are being generated.</p>
<p>It should be possible to experimentally measure the number of events
by adding counters to the CatalogWatcher code and running it for
an extended period of time. Periodically, the counters can be dumped
to the logs.</p>
<p>Ideally we would have R X 3 x 24 x 31 counters so we could
compute the number of events per hour, day, week, and month per
catalog root (assuming R roots). This is not a large amount of
space . So at the end of every hour, day, and month, a log entry
would be generated with a terse listing of the counters. The
counters would then be reset.</p>
<p>The big problem is that these counters need to be executed on one
of the Unidata productions systems (aka motherlode) in order to get
realistic numbers. It is not clear if this is possible.</p>
<p>Note that the measurements would be taken over all the catalog
roots. It is possible that other parts of the file system such as
the GRIB indices would also need to be included.</p>
<h2><a name="Security">Security</a></h2>
<p>The question is: how open should our broker be to arbitrary
clients? The short answer is that it probably should have
the same access controls as the associated Thredds server.
Again, it is not clear what the requirements here should be.</p>
<h2><a name="Summary">Summary</a></h2>
<p>The original motivation for this proposal was to try to mitigate
clients hammering our production servers. The idea is that if
they have access to changes events, they do not ned to poll our
servers (or least quite so often).</p>
<p>Will it work? It depends on several factors:</p>
<ol>
<li>Is is easy for clients to use?</li>
<li>Are they willing to change?</li>
<li>Is it sufficiently reliable that clients will not feel
they are loosing events?</li>
</ol>
<h2><a name="AppendixA">Appendix A. Miscellaneous Notes</a></h2>
<h3>Multiple Servers Per Broker</h3>
<p>The above discussion assumes that there is the server - broker
association is one to one. Other arrangements are possible such
as multiple servers sharing a single broker or a server using
multiple brokers. The relative merits of these alternatives
is unclear to me, but the possibility is worth noting.</p>
<h3>Thredds Persistence</h3>
<p>I could implement durability in the Thredds server
by keeping a queue of changed directories
and allowing clients to ask for everything since <some date>.
In effect, I would be subsuming the broker as part of Thredds.
Possible and maybe an acceptable approach.</p>
https://www.unidata.ucar.edu/blogs/developer/entry/the-death-of-server-sideThe Death of Server-Side ComputingDennis Heimbigner 2017-06-05T12:25:45-06:002017-06-05T12:25:45-06:00<p>For a number of years, the Unidata Thredds group has been in the
process of "implementing" server-side computation
<em>Real-Soon-Now</em> (as the saying goes).</p>
<p>Events have overtaken the previous notion of server-side
computing and here we try to codify a replacement that uses a
separate server model based on Jupyter (an offshoot of IPython).</p>
<p>From the point of view of Unidata, Jupyter provides a powerful
alternative to roll-your-own server-side computing. It supports
multiple, "real" programming languages. It is a server itself, so
it can be co-located with an existing Thredds server. And, most
importantly, it is designed to execute small programs written in any
of its supported languages.</p>
<p>We are proposing to implement server-side computing for Thredds by using
one or more co-located Jupyter servers. This document elaborates on the
capabilities and required support infrastructure to make this
proposal operational.</p>
<h2>(OK, that is hyperbole, but...)</h2>
<!--# The Death of Server-Side Computing-->
<p><strong>Initial Draft</strong>: 2017-5-28 <br />
<strong>Last Revised</strong>: 2017-6-5 <br />
<strong>Author</strong>: Dennis Heimbigner, Unidata </p>
<h3>Table of Contents</h3>
<ul>
<li><a href="#Introduction">Introduction</a> <br /></li>
<li><a href="#Jupyter">The Alternative: Jupyter</a> <br /></li>
<li><a href="#Python">The Language: Python</a></li>
<li><a href="#Architecture">The Notional Architecture</a></li>
<li><a href="#Accessing">Accessing the Jupyter Server</a></li>
<li><a href="#Asynchronous">Asynchronous Operation</a> <br /></li>
<li><a href="#Value">Thredds Value Added</a> <br /></li>
<li><a href="#Specialization">Specialized Capabilities</a></li>
<li><a href="#Security">Access Controls</a> <br /></li>
<li><a href="#Charging">Resource Controls</a> <br /></li>
<li><a href="#Plan">Planned Activities</a></li>
<li><a href="#References">References</a> <br /></li>
</ul>
<h2><a name="Introduction">Introduction</a></h2>
<p>For a number of years, the Unidata Thredds group has been in the
process of "implementing" server-side computation
<em>Real-Soon-Now</em> (as the saying goes).</p>
<p>Server-side computing embodies the idea that it is most
efficient to physically co-locate a computation with the
datasets on which it is operating. As a rule, this meant having
a server execute the computation because the dataset was
controlled by that server. Server-side computing servers for
the atmospheric community have existing in various forms for a
while now: GRADS, DAP2 servers, and ADDE, for example.</p>
<p>One -- and perhaps <em>The</em> -- major stumbling block to server-side
computing is defining and implementing the programming language
in which the computation is coded. In practice, server-side
systems have developed their own language for this purpose.
This is a problem primarily because it is very difficult to
define and implement a programming language. Often the
"language" started out as some form of constraint expression
(e.g. DAP2, DAP4, and ADDE). Over time, it would accrete other
capabilities: conditionals, loops, etc. In time, it grew into a
badly designed but more complete programming language. Since it
was rarely implemented by language/compiler experts, it usually
was quirky and presented a significant learning curve for users.</p>
<p>The advantage to using such a home grown language was that it
could be tailored to the dataset models supported by the
server. It also allowed for detailed control of programs. This
made certain other issues easier: access controls and resource
controls, for example.</p>
<p>The author recognized the language problem early on and was
reluctant to go down that path. As the primary "pusher" for
server-side computing at Unidata, this has delayed implementation
for an extended period.</p>
<h2><a name="Jupyter">The Alternative: Jupyter</a></h2>
<p>Fortunately, about three years ago, project Jupyter [1]
was created as an offshoot of the IPython Notebook
system. It provided a multi-user, multi-language compute engine
in which small programs could be executed. With the advent
of Jupyter, IPython then refactored its computation part to
use Jupyter.</p>
<p>From the point of view of Unidata, Jupyter provides a powerful
alternative to traditional server-side computing. It supports
multiple, "real" programming languages. It is a server itself, so
it can be co-located with an existing Thredds server. And, most
importantly, it is designed to execute small programs written in any
of its supported languages.</p>
<p>In the rest of this document, the term "program" will, as a
rule, refer to programs executing within a Jupyter server.</p>
<h2><a name="Python">The Language: Python</a></h2>
<p>In order to avoid the roll-your-own language problem, it was decided
to adopt wholesale an existing modern programming language. This meant
that the language was likely to be complete right from the start. Further,
the learning curve would be reduced because a significant amount
of supporting documentation and tutorials would be available.</p>
<p>We have chosen Python as our preferred language. We made this
choice for several reasons.</p>
<ol>
<li>Python is rapidly being adopted by the atmospheric sciences
community as its language of choice.</li>
<li>There is a very active community that is
developing packages for use by the scientific community
and more specifically for the atmospheric sciences community.
Examples are numerous, including numpy, scipy, metpy, and siphon.</li>
<li>It is one of the languages supported by Jupyter.</li>
</ol>
<p>To the extent that Jupyter supports other languages, it would be
possible to write programs in those languages. However, I would not expect
Unidata to expend any significant resources on those other languages.
The one possible exception is if/when Jupyter supports Java.</p>
<h2><a name="Architecture">The Notional Architecture</a></h2>
<p><img src="https://github.com/DennisHeimbigner/Images/raw/master/Jupyter%2BThredds%2BServlet.png" alt="Notional Architecture" style="width: 300px;"/></p>
<p>The notional architecture we now espouse is shown in Figure 1.
Basically, a standard Thredds server runs along side a
Jupyter server. A program executing in the Jupyter server has
access to the data on the Thredds server either using the file
system or using some streaming protocol (e.g. DAP2). File access
is predicated on the assumption that the two servers are a
co-located and share a common file system.</p>
<p>The Thredds server currently requires and uses some form of
servlet engine (e.g. Tomcat). We exploit that to provide a
front-end servlet to act as intermediary between a user and the
Jupyter server (see <a href="#Accessing">below</a>).</p>
<p>So now, instead of sending a program to the Thredds server, it
is sent to the Jupyter server for execution. That executing
program is given access to the Thredds server using a variety of
packages (e.g. Siphon [2]). Once its computation is completed,
its resulting products can be published within a catalog on
Thredds to make it accessible to user programs.
Once in the catalog. that product can be accessed by external
clients using existing streaming protocol services. In some cases,
it may also be possible to access that product
using a shared file system.</p>
<p>This discussion assumes the existence of a single Jupyter server,
but it will often be desirable to allow mutltiple such servers.
Examples of the utility of multiple servers will be discussed
in subsequent sections.</p>
<h2><a name="Accessing">Accessing the Jupyter Server</a></h2>
<p>Access to the Jupyter server will be supported using several
mechanisms. Each mechanism has a specific use case.</p>
<h3>IPython Access</h3>
<p>Though not shown in Figure 1, it is assumed that existing
IPython access to Jupyter is available. This path is, of course,
well documented elsewhere in the IPython+Jupyter literature.</p>
<h3>Web-based Access</h3>
<p>Another use-case is to provide access for scientists with
limited programming skills or for other users requiring
simple and occasional computations.</p>
<p>The servlet box in Figure 1 illustrates this.
For this case. Client web browsers would carry out forms based
computations via the front-end servlet running under some
Apache Tomcat (other other servlet engine).</p>
<h3>Programmatic Access</h3>
<p>Scientists will still write standalone programs that need to
process computed data. Others will write value-added wrapper
programs to provide, for example, additional capabilities such
as plotting or other graphical presentation.</p>
<p>These use cases will require the ability to upload and execute
programs from client-side programs. The simplest approach here
is to build on the web-based version. That is, the client side
program would also access the servlet, but using a modified
and stream-lined interface.</p>
<h2><a name="Asynchronous">Asynchronous Operation</a></h2>
<p>Some computations will take a significant amount of time to
complete. Submitting such a computation through the Thredds
server interface is undesirable because it requires
either blocking of the client for long periods of time or
complicating the Thredds server to make it support
<em>asynchronous</em> execution. The latter usually involves
returning some kind of token (aka <em>future</em>) to the client
that it can interrogate to see if the computation is
complete. Or alternatively, providing some form of server to
client event notication mechanism. In any case, such mechanisms
are complicated to implement.</p>
<p>Direct client to Jupyter communication (see previous <a href="#Accessing">section</a>)
can provide a simple and effective alternative to direct
implementation of asynchronous operation. Specifically, the client
uploads the program via IPython or via a web browser to the
Jupyter server. As part of its operation the program
uploads its final product(s) to some catalog in the Thredds server.
The client is then responsible for detecting that the product
had been uploaded, which then enables further processing of that product
as needed.</p>
<h2><a name="Value">Thredds Value Added</a></h2>
<p>Given the approach advocated in this document,
on what should Unidata focus to support this approach.</p>
<h3>Accessing Thredds Data</h3>
<p>First and foremost, we want to make it easy, efficient, and fast
for programs to access the data within a co-located Thredds server.</p>
<p>Thredds currently provides a signficant number of "services" [3]
through which metadata and data can be extracted from a Thredds server.
These include at least the following: DAP2 (OpenDAP), DAP4, HTTPServer,
WCS, WMS, NetcdfSubset, CdmRemote, CdmrFeature, ISO. NCML, and UDDC.</p>
<p>The cost to access to data via some commonly supported
protocols, such as DAP2 or CdmRemote, is relatively independent of
co-location, so using such protocols is probably not the most
efficient method.</p>
<h3>File Download</h3>
<p>The most efficient inter-server communication is via a shared file
system accessible both to the Thredds server and the
Jupyter server.</p>
<p>As of Thredds 5 it is possible to materialize both datasets and
(some kinds of) streams as files: typically netcdf-3 (classic)
or netcdf-4 (enhanced). One defines a directory into which
downloads are stored. A special kind of request is made to a
Thredds server that causes the result of the query to be
materialized in the specified directory. The name of the
materialized file is then returned to the client.</p>
<h3>Siphon</h3>
<p>The Siphon project [2,4] is designed to wrap access to a Thredds
server using a variety of Thredds services. As such, it will
feature prominently in our system. Currently, siphon supports
the reading of catalogs, and data access using the Thredds
netcdf subset service (NCSS), CdmRemote, and Radar Data.</p>
<h3>Operators</h3>
<p>The raison d'etre of server side computation is to input
datasets, apply operators to them and produce new product
datasets. In order to simplify this process, it is desirable
to make available many high-level operators so that
a computation can be completed by the composition of operators.</p>
<p>Often, server-side computation is illustrated using simple
operations such as sum and average. But these kinds of operators
are likely to only have marginal utility; they may be useful,
but will not be the operators doing the heavy lifting of server
side computation.</p>
<p>Accumulating useful operators is possibly another place where
Unidata can provide added value. Unidata can both provide a
common point of access, as well as providing some form of vetting
for these operators.</p>
<p>One example is Pynco [5]. This is a Python wrapping of
the netCDF Operators (NCO) [6]. NCO is currently all command line,
so Pynco wraps them to allow programmatic invocation of the various
operators.</p>
<p>As part of the operator support, Unidata might wish
to create a repository (using conda channels or Github?) to
which others can contribute.</p>
<h3>Publication (File Upload)</h3>
<p>When a program is executed within Jupyter,
it will produce results that need to be communicated to others --
especially the client originating the computation.
The obvious way to do this is to used the existing Thredds
<em>publication</em> facilities, namely catalogs.</p>
<p>As of Thredds 5, it is possible to add a directory to some top-level
catalog. Uploading a file into that directory causes it to appear
in the specified catalog. Uploading can be accomplished either
by file system operations or via a browser forms page.</p>
<h2><a name="Specialization">Specialized Capabilities</a></h2>
<p>Another way to add value is to make libraries available
that support specialized kinds of computations.</p>
<h3>GPU Support</h3>
<p>The power of Graphics Processing Units (GPUs) has significantly
increased over the last few years. Libraries now exist for
performing computations on GPUs. To date, using a GPU on
atmospheric data is uncommon. It should be possible improve
the situation by making operators available that use a GPU
underneath to carry out the computation.</p>
<h3>Machine Learning Support</h3>
<p>Artificial Intelligence, at least in the form of machine learning,
is another example of a specialized capability. Again,
use of AI to process atmospheric data is currently not common.
It should be possible to build quite sophisticated subsystems
supporting the construction of AI systems for doing predictions
and analyses on such data.</p>
<h2><a name="Security">Access Controls</a></h2>
<p>There is a clear danger in providing a Jupyter server
open to anyone to use. Such a server is a potential
exploitable security hole if it allows the execution of arbitrary
code. Further, there are resource issues when anyone is allowed
to execute a program on the server.</p>
<p>Much of the support for access controls will depend on the
evolving capabilities implemented by the Jupyter project. But
we can identify a number of access controls that will be needed
to protect a Jupyter server.</p>
<h3>Sandboxing</h3>
<p>The most difficult problem is to prevent the execution of
arbitrary code on the Jupyter server. Effectively, such code
must be <em>sandboxed</em> to control what facitilities are made
available to executing programs. Two sub-issues arise.</p>
<ol>
<li><p>Some Python packages must be suppressed. Arbitrary file operations
and sub-process execution are two primary points of concern.</p></li>
<li><p>Various packages must be useable: numpy, metpy, siphon, for example.
They are essential for producing the desired computational products.
However, the security of the system depends on the security of those
packages. If they provide accessible security-flaws, then security as a
whole is compromised.</p></li>
</ol>
<h3>Authentication</h3>
<p>Strong authentication mechanisms will need to be implemented
so that only authorized users can utilize the resources of a Jupyter
server.
Jupyter authentication may need to be coordinated with
the Thredds server so that some programs executed on Jupyter
can have access to otherwise protected datasets on the Thredd server.
This is one case where multiple Jupyter servers (and even multiple
Thredds servers) may be needed to support specialized access
to controlled datasets by using isolated Jupyter servers.</p>
<h2><a name="Charging">Resource Control Mechanisms</a></h2>
<p>Uncontrolled execution of code can potentially be a significant
performance problem. Additionally, it can result in significant
costs (in dollars) being charged to the server's owner.</p>
<p>For many situations, it will be desirable to force clients to
stand-up their own Jupyter server co-located with some Thredds
server in a way that allows the client to pay for the cost of
the Jupyter server. Cloud computing is the obvious
approach. Clients will pay for their own virtual machine that is
as "close" to the Thredds server as their cloud system will
allow. The client can then use their own Jupyter server on
their own virtual machine to do the necessary computations and
for which they will be charged.</p>
<h2><a name="Plan">Planned Activities</a></h2>
<p>A preliminary demonstraion of communication between Thredds and Jupyter
was created by Ryan May under the auspices of the ODSIP grant [7]
funded from the NSF Earthcube program.</p>
<p>We anticipate starting from the ODSIP base demonstration and
extending it over time. Subject to revision, the current plan
involves the following steps.</p>
<h4>Step 1. Servlet-Based Access</h4>
<p>The first step is to build the servlet front-end.
This may be viewed as a stripped-down mimic of IPython.
This servlet will support both forms-based access as well as
programmatic access.</p>
<h4>Step 2. Operators</h4>
<p>An initial set of operator libraries will need to be
collected so that testing, experimentation, and tutoring
can proceed. This will be an ongoing process. One can hope
that some form of repository can be established and that
a critical mass of operators will begin to form.</p>
<h4>Step 3. Configuration</h4>
<p>The next step is to make it possible, though not necessarily easy,
for others to stand up their own Jupyter + Thredds. One approach
would be to create a set of Docker instructions for this purpose.
This would allow others to directly instantiate the Docker container
as well as provide a recipe for non-Docker operation.</p>
<h4>Step 4. Examples</h4>
<p>As with any new system,
external users will have difficulties in using it.
So a variety of meaningful examples will need to be
created to allow at least cutting-edge users to begin
to experiment with the system. Again, this will be
an on-going activity.</p>
<h4>Step 5. Access Controls</h4>
<p>At least an initial access control regime cannot be delayed for very long.
Some external users can live without this in the short term.
But for more widespread use, the users must have some belief in the
security of the systems that they create. As with operators, this will
be an on-going process.</p>
<h4>Step 6. Workshops and Tutorials</h4>
<p>At some point, this approach must be presented to the larger community.
For Unidata, this is usually done using our Workshops. Additionally,
video tutorials and presentations will need to be created.</p>
<h2><a name="References">References</a></h2>
<p>[1] https://en.wikipedia.org/wiki/IPython#Project_Jupyter <br />
[2] https://github.com/Unidata/siphon <br />
[3] https://www.unidata.ucar.edu/software/thredds/v4.3/tds/tds4.3/reference/Services.html
[4] https://unidata.github.io/siphon/
[5] https://github.com/nco/pynco
[6] http://nco.sourceforge.net/
[7] https://www.nsf.gov/awardsearch/showAward?AWD_ID=1343761</p>
https://www.unidata.ucar.edu/blogs/developer/entry/proposed-thredds-architecture-changes-forProposed Thredds Architecture Changes for OSGI/JigSawDennis Heimbigner 2017-05-25T15:44:52-06:002017-05-31T17:55:24-06:00<p>This post provides some preliminary ideas on the consequences of moving TDS to use OSGI or JigSaw.</p>
<p>This post provides some preliminary ideas on the consequences of moving TDS to use OSGI or JigSaw.</p>
<p>Assumptions:</p>
<ol>
<li>OSGI and Jigsaw will be sufficiently similar so that this proposal with work with either with some tweeks.</li>
<li>Initial target is Thredds server</li>
<li>We will want to dynamically load at least the following kinds of things on the server.
<ul><li>OSPs (e.g. netcdf4, grib, etc)</li>
<li>RAFs (e.g. S3 and HDFS)</li>
<li>Services (e.g. DAP4)
I will refer to these all generically as "bundles" (OSGI terminology)</li></ul></li>
</ol>
<p>The loading process could be either:</p>
<ol>
<li>lazy - load only when actually requested</li>
<li>eager- load at startup to provide a specifically configured TDS starting with a skeleton TDS.</li>
</ol>
<p>For the eager case, we can assume that some config file (e.g. ThreddsConfig.xml) contains the information needed to dynamically extend the tds to make various bundles available.</p>
<p>For the lazy case, it must be possible to create a "signal" that some bundle is needed and must be preloaded. I can see two obvious ways to do this.</p>
<ol>
<li>Stubs -- we provide stub classes for all the bundles so that calling the stub API the first time causes the bundle to be loaded and then used from then on.</li>
<li>Explicit -- any user of a bundle must explicitly invoke some code to load the required bundle.</li>
</ol>
<p>My current inclination is to use the eager approach since it is simpler and still allows us to keep a small footprint .war file.</p>
<p>Another question is: where are the bundles stored? I assume they are not kept in the .war file since that would defeat one of the purposes of using dynamic loading. I presume there would be some default repository(s) plus a configurable set of additional repositories from which bundles can be pulled. It may be that NEXUS is usable for this purpose.</p>
<p>A note on IOSPs. Currently the IOSP to use is determined by calling a method that looks at a RAF wrapping a file. This method decides if itcan process that associated file. If we were to use lazy loading, it is probable that for IOSP's we would need to divide the IOSP into two parts: one for testing applicability and one for processing. This is an argument for using eager loading.</p>
https://www.unidata.ucar.edu/blogs/developer/entry/new-tds-cloud-architectures-proposalNew TDS Cloud Architectures: Proposal 1Dennis Heimbigner 2016-09-15T20:39:30-06:002016-09-30T16:29:11-06:00<p>The Thredds Data server (TDS) was designed to operate in a client-server architecture. Recently, Unidata has moved TDS into the cloud using its existing architecture.</p>
<p>There seems to be agreement inside Unidata that we need to begin rethinking that architecture to adapt to the realities of the cloud.</p>
<p>[First Draft: 9/15/2016]<br>
[Last updated: 9/16/2016]</p>
<p>The Thredds Data server (TDS) was designed to operate in a client-server architecture. Recently, Unidata has moved TDS into the cloud using its existing architecture.</p>
<p>There seems to be agreement inside Unidata that we need to begin rethinking that architecture to adapt to the realities of the cloud.</p>
<h2>Proposal 1</h2>
<p>This (first) proposal makes an assumptions about the nature of the cloud, and especially as it is likely to be in the near future.</p>
<p>This Assumption is that rather than having large quantities of data behind a (TDS) server, all data will be stored in cloud storage such as Amazon S3 or Azure blobs.</p>
<p>Secondarily, in such an environment, TDS cannot be aware of all data because it the set of all data is likely to be growing at a fast rate and by organizations not known to a given TDS server.</p>
<p>In this environment, the role of TDS becomes more of a locator and transformer of data. That is, TDS is must be made aware of some datasets and then it must apply various computations on that data to produce new derived data and then publish it into cloud storage.</p>
<p>Some consequences:</p>
<ul>
<li>Unidata may have to get into the data discovery business; somthing it has tended to avoid so far.</li>
<li>The new TDS must be organized so that others can extend its capabilities by providing new kinds of computation models.</li>
<li>It is not clear if protocols such as DAP2, DAP4, CDMremote, etc. will be needed any longer because clients will be able to access the computed products using the S3 or Blob interfaces. In effect, streaming becomes replaced with the reification of computations into a file in S3/Blob.</li>
<li>Asynchronous computations more or less fall out of this proposed architecture if it possible for a client to poll S3/Blob for some dataset or for getting an event notification from the cloud.</li>
<li>Standardized file formats now become important than ever. The primary such formats for atmospherics is, I believe, netcdf3 and netcdf4. The HDF5 format is likely to also become more important, although its complexity vis-a-vis netcdf-4 will IMO hold it back.</li>
</ul>
<p>Some questions:</p>
<ul>
<li>Is there room for another (or several) standard file formats?</li>
<li>Is it possible to define a wrapper API for S3 and Azure blobs and whatever google and other cloud companies provde? This API would help clients having to lock in on a single provider?</li>
<li>What is the relation between this proposal and, say, Amazon lambda, or microservices?</li>
</ul>
<p>[9/16/2016]</p>
<h2>Notes on Services to be Provided</h2>
<h3>Catalog</h3>
<p>Our current catalog system assumes that there is some set of dataset over which we have control and knowledge. As a rule, that set is the set of datasets on the Thredds server machine.</p>
<p>Under this proposal, this becomes less true. There may be no such set. Let us propose instead that we provide an umbrella catalog for which others can ask to have their datasets included. Additionally, others might ask to have their catalogs grafted onto our catalog tree. In any case, we are effectively talking about a federated catalog. </p>
<p>The value added is that we become the place to go to locate datasets. A consequence is that it becomes incumbent on us to:</p>
<ol>
<li>Make searching our catalogs easy and support sophisticated searches.</li>
<li>Provide our catalog in a variety of formats, such as in the form of a set of relational tables.</li>
<li>Provide the ability to crack datasets to obtain additional information for our catalogs.</li>
</ol>
<h3>CDM</h3>
<p>We also need to think about the role of CDM in this proposal. currently, CDM is our UNCOL (historical reference) in that CDM is the common model that allows us to separate the dataset format from the users of that dataset. That is, an IOSP maps some data format to CDM and then tools can be defined in terms of CDM to avoid having to know about all the actual data formats. This is a very powerful approach and we should not discard it.</p>
<h3>Subset Services</h3>
<p>Data subsetting services, in the form of NCSS and the dap(2,4) constraint languages is an additional service we provide that will continue to be important in any new architecture. In fact, I think that pulling this out as a set of services would be enhanced with this architecture. [Needs more thought].</p>
<p>[More thoughts will be added as they occur to me]</p>
https://www.unidata.ucar.edu/blogs/developer/entry/upload-and-download-support-forUpload and Download Support for TDSDennis Heimbigner 2016-08-23T19:57:54-06:002016-08-31T11:33:46-06:00<p>For version 5.0.0, it is possible to configure TDS to support the uploading and downloading of files into the local file system using the "/thredds/download" url path. This is primarily intended to support local File materialization for server-side computing. The idea is that a component such as<a href="http://jupyter.org">Jupyter</a> can materialize files from TDS to make them available to code being run in Jupyter. Additionally, any final output from the code execution can be uploaded to a specific location in the TDS catalog to make it available externally.</p>
<p>For version 5.0.0, it is possible to configure TDS to support the uploading and downloading of files into the local file system using the "/thredds/download" url path. This is primarily intended to support local File materialization for server-side computing. The idea is that a component such as<a href="http://jupyter.org">Jupyter</a> can materialize files from TDS to make them available to code being run in Jupyter. Additionally, any final output from the code execution can be uploaded to a specific location in the TDS catalog to make it available externally.</p>
<p>Note that this functionality is not strictly necessary since it could all be done on the client side independent of TDS. It is, however, useful because the client does not need to duplicate code already available on the TDS server. This means that this service provides the following benefits to the client.</p>
<ol>
<li>It is lightweight WRT the client</li>
<li>It is language independent</li>
</ol>
<h2>Assumptions</h2>
<p>The essential assumption for this service is that any external code using this service is running on the same machine as the Thredds server,or at least has a common file system so that file system operations by thredds are visible to the external code.</p>
<p>An additional assumption is that "nested" calls to the Thredds server will not cause a deadlock. This is how access to non-file datasets (e.g. via DAP2 or DAP4 or GRIB or NCML) is accomplished. That is, the download code on the server will do a nested call to the server to obtain the output of the request. Experimentation shows this is not currently a problem.</p>
<h2>Supported File Formats</h2>
<p>Currently the dowload service supports the creation of files in two formats:</p>
<ol>
<li>Netcdf classic (aka netcdf-3)</li>
<li>Netcdf enhanced (aka netcdf-4)</li>
</ol>
<h2>Download Service Protocol</h2>
<p>A set of query parameters control the operation of this service. Note that all of the query parameter values (but not keys) are assumed to be url-encoded (%xx), so beware. Also, all return values are url-encoded.</p>
<h3>Request and Reply</h3>
<p>Invoking this service is accomplished using a URL pattern like this.</p>
<pre><code>http://host:port/thredds/download/?key=value&key=value&...
</code></pre>
<p>In all cases, the reply value for the invocation will be of this form.</p>
<pre><code>key=value&key=value&...
</code></pre>
<p>The specific keys depend on the invocation.</p>
<h3>Defined Requests</h3>
<p>The primary key is <strong>request</strong>. It indicates what action
is requested of the server.</p>
<p>The set of defined values for the <strong>request</strong> key are as follows.</p>
<ul>
<li><strong>download</strong></li>
<li><strong>inquire</strong></li>
</ul>
<h4>Request Keys Specific to "request=download"</h4>
<ul>
<li><p><strong>format</strong> -- This specifies the format for the returned dataset; two values are currently defined: <strong>netcdfd3</strong> and <strong>netcdf4</strong>.</p></li>
<li><p><strong>url</strong> -- This is a thredds server url specifying the actual dataset to be downloaded.</p></li>
<li><p><strong>target</strong> -- This specifies the relative path for the downloaded file. If the file already exists, it will be overwritten. Any leading directories will be created underneath <strong>downloaddir</strong> (see below).</p></li>
</ul>
<h4>Reply Keys Specific to "request=download"</h4>
<ul>
<li><strong>download</strong> -- The absolute path of the downloaded file. In all cases, it will be under the <strong>downloaddir</strong> directory.</li>
</ul>
<h4>Request Keys Specific to "request=inquire"</h4>
<ul>
<li><strong>inquire</strong> -- This specifies a semi-colon separated list of keys whose value is desired. Currently, the only defined key is <strong>downloaddir</strong>, which returns the absolute path of the download directory. All downloaded files will be placed under this directory.</li>
</ul>
<h4>Reply Keys Specific to "request=inquire"</h4>
<ul>
<li><strong>downloaddir</strong> -- The absolute path of the directory under which all downloaded files are placed.</li>
</ul>
<h2>Upload Service Protocol</h2>
<p>File upload is not handled directly by calling the Thredds server. Rather, it is handled by creating a directory that is to be scanned by the Thredds server to be made available at a specific point in the standard catalog.</p>
<h2>Thredds Server Configuration</h2>
<p>In order to activate upload and/or download, one or both of the following Java -D flags must be provided to the Thredds server.</p>
<ul>
<li><strong>-Dtds.download.dir</strong> -- Specify the absolute path of a directory into which files will be downloaded.</li>
<li><strong>-Dtds.upload.dir</strong> -- Specify the absolute path of a directory into which files may be uploaded.</li>
</ul>
<p>Security concerns (see below) must be addressed when setting the permission on these directories.</p>
<p>In order to complete the establishment of an upload directory, the following entry must be added to the <strong>catalog.xml</strong> file for the Thredds server.</p>
<pre><code><datasetScan name="Uploaded Files" ID="upload" location="${tds.upload.dir}" path="upload/">
<metadata inherited="true">
<serviceName>all</serviceName>
<dataType>Station</dataType>
</metadata>
</datasetScan>
</code></pre>
<p>Optionally, if one wants to make the download directory visible, the following
can be added to the same file.</p>
<pre><code><datasetScan name="Downloaded Files" ID="download" location="${tds.download.dir}" path="download/">
<metadata inherited="true">
<serviceName>all</serviceName>
<dataType>Station</dataType>
</metadata>
</datasetScan>
</code></pre>
<h2>Security Issues</h2>
<p>It should be clear that providing upload and download capabilties can introduce security concerns.</p>
<p>The primary issue is that this service will cause the Thredds server to write into user-specified locations in the file system. In order to prevent malicious writing of files, the download directory (specified by tds.download.dir) should be created in a safe place. Typically, this means it should be placed under a directory such as "/tmp" on Linux or an equivalent location for other operating systems.</p>
<p>This directory will be read and written by the user running the Thredds server, typically "tomcat". The best practice for this is to create a specific user and group and set the download directories user and group to those values. Then the appropriate Posix permissions for that directory should be "rwxrwx---". Finally, the user "tomcat" should be added the created group.</p>
<p>Corresponding concerns apply to the upload directory and so its owner, group, and permissions should be set similarly to the download directory.</p>
<p>The url used to specify the dataset to be downloaded also raise security concerns. The url is tested for two specific url patterns to ensure proper behavior.</p>
<ol>
<li>The pattern".." is disallowed in order to avoid attempts to escape the thredds sandbox.</li>
<li>The pattern"/download/" is disallowed in order to prevent an access loop in which a download call attempts to call download again.</li>
</ol>
<p>In order to provide additional sandboxing, the url provided by the client is modified to ignore the host, port and servlet prefix. They are replaced with the "<host>:<port>/thredds" of the thredds server. This is to prevent attempts to use the thredds server to access external data sources, which would otherwise provide a security leak.</p>
<p>Finally, it is desirable that some additional access controls be applied. Specifically, Tomcat should be configured to require client-side certificates so that all clients using this service must have access to that certificate.</p>
<h2>Examples</h2>
<h3>Example 1: Download a file (via fileServer protocol)</h3>
<p>request:</p>
<pre><code>http://localhost:8081/thredds/download/?request=download&format=netcdf3&target=nc3/testData.nc3&url=http://host:80/thredds/fileServer/localContent/testData.nc&testinfo=testdirs=d:/git/download/tds/src/test/resources/thredds/server/download/testfiles
</code></pre>
<p>reply:</p>
<pre><code>download=c:/Temp/download/nc3/testData.nc3
</code></pre>
<p>Note: the encoded version of the request:</p>
<pre><code>http://localhost:8081/thredds/download/?request=download&format=netcdf3&target=nc3%2FtestData.nc3&url=http%3A%2F%2Fhost%3A80%2Fthredds%2FfileServer%2FlocalContent%2FtestData.nc&testinfo=testdirs%3Dd%3A%2Fgit%2Fdownload%2Ftds%2Fsrc%2Ftest%2Fresources%2Fthredds%2Fserver%2Fdownload%2Ftestfiles
</code></pre>
<h3>Example 2: Download a DAP2 request as a NetCDF-3 File</h3>
<p>request:</p>
<pre><code>http://localhost:8081/thredds/download/?request=download&format=netcdf3&target=testData.nc3&url=http://host:80/thredds/dodsC/localContent/testData.nc&testinfo=testdirs=d:/git/download/tds/src/test/resources/thredds/server/download/testfiles
</code></pre>
<p>reply:</p>
<pre><code>download=c:/Temp/download/testData.nc3
</code></pre>
https://www.unidata.ucar.edu/blogs/developer/entry/thredds_and_java_8_plansTHREDDS and Java 8 plansSean Arms2015-05-26T08:30:00-06:002015-05-26T18:31:13-06:00<div>THREDDS development is now switching to version 5.0 which will require Java 8. Version 5 is a major upgrade and some of the APIs will change. Deprecated classes will be moved to a legacy jar and will not be supported. If you are a developer, you will need to test the new version against your code. We expect to have an alpha release out by July for that purpose.</div>Netcdf-Java and the TDS version 4.6.1 have been released. This version requires Java 7+. Bug fixes and minor enhancements will continue on the 4.6 branch for six months or so.
<br />
<br />
Development is now switching to version 5.0 which will require Java 8. Version 5 is a major upgrade and some of the APIs will change. Deprecated classes will be moved to a legacy jar and will not be supported. If you are a developer, you will need to test the new version against your code. We expect to have an alpha release out by July for that purpose.
<br />
<br />
Java 7 had its <a href="https://www.java.com/en/download/faq/java_7.xml">final</a> release last month, and is at End of Life (EOL). So security fixes will no longer be applied and pushed to the user. If you are running a public server, you must upgrade to <a href="http://www.oracle.com/technetwork/java/javase/downloads/index.html">Java 8</a>. Talk to your sysadmin about getting Java 8 installed on production machines. Educate your security team about this issue if its not on their radar. Do it now before its an emergency.
<br />
<br />
On the desktop, we also recommend that you upgrade Java to version 8 now. All known backwards compatibility issues with THREDDS are solved (but if you run into any please let us know).
<br />
<br />
The THREDDS Team