Building Support for Efficient MetPy Calculations Across Large Datasets with Dask

Editor's Note:
Due to the COVID-19 pandemic, Unidata's 2020 summer interns did not travel to Boulder to work on their projects in person. Instead, they interacted with Unidata developers through Slack, Zoom, and other electronic means.

Russell Manser
Russell Manser

Coming into this summer, my goal was to integrate Dask Array support into MetPy. I knew that this was an ambitious task, but I am happy to say that I made progress toward accomplishing it!

Support for Dask Arrays in MetPy will allow efficient calculations on large datasets. Dask is a Python library that distributes operations across datasets that are too large to load into memory on a personal computer, or that are loaded into distributed memory on a supercomputer. These data could be tens or hundreds of gigabytes large on disk. Examples include climate observations and simulations, forecast ensembles, and high-resolution cloud model output.

Ensemble mean forecast
Ensemble mean precipitation forecast
(click to enlarge)

To provide support for Dask Arrays I worked on some of MetPy’s dependencies, such as Pint and Xarray. Specifically, I implemented the Dask collection interface in Pint so Quantity objects can wrap Dask Arrays. I also improved how Xarray handles wrapped Pint and Dask Arrays. When official versions of Pint and Xarray are released with these included features, MetPy can begin preliminary support for them.

I also proposed an automated data type testing suite in MetPy. This was motivated partly by anticipated support for Dask Arrays, but also by bug reports of unexpected behavior with currently supported data types. The existing test suite in MetPy covers all functions but is splintered when accounting for data types. The testing suite I proposed automatically builds tests for each function against each data type. When contributors wish to add a new function to MetPy, all they need to do is update a nested dictionary with test data for their function, then the testing suite takes care of the rest!

Dask Array
Dask array composed of NumPy arrays.

I made progress toward achieving an ambitious goal this summer while learning a great deal about contributing to varying open source projects. I am very thankful to have had this experience. Because of it, I feel more prepared to tackle programming problems and confident that I can make meaningful community contributions.

Comments:

Post a Comment:
  • HTML Syntax: Allowed
Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« August 2020
SunMonTueWedThuFriSat
      
1
2
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today