12 April 2016

The University of Washington’s eScience Institute, a unique environment for geospatial data science education

Posted by nbompey

By Anthony Arendt

This is part of a new series of posts that highlight the importance of Earth and space science data and its contributions to society. Posts in this series showcase data facilities and data scientists; explain how Earth and space science data is collected, managed and used; explore what this data tells us about the planet; and delve into the challenges and issues involved in managing and using data. This series is intended to demystify Earth and space science data, and share how this data shapes our understanding of the world.

The Washington Research Foundation Data Science Studio at the University of Washington. Credit: Anissa Tanweer/ University of Washington.

The Washington Research Foundation Data Science Studio at the University of Washington.
Credit: Anissa Tanweer/ University of Washington.

Earth scientists can choose from an ever-increasing array of datasets when they set out to study our changing planet. Every year, advances in remote sensing and sensor network technologies increase in resolution, streaming data to us on demand, in real time. If you’re like me, you find this new era of discovery exhilarating but also overwhelming. How will I ever find the time to learn the software and cloud technologies needed to keep up with this flow of new information?

The University of Washington’s (UW) eScience Institute was conceived with these challenges in mind. Formed in 2008 with support from State of Washington and grown with support from the Gordon and Betty Moore Foundation, the Alfred P. Sloan Foundation, and the Washington Research Foundation, the eScience Institute works to develop a new generation of researchers skilled in both their own domain as well as the techniques and technologies of data science. A flagship program of the institute has been the data science incubator. This intensive, 10-week program matches data science experts with students and faculty who are trying to solve a data science challenge in their particular domain.

This year, two of our projects had a strong Earth science focus, providing us with a unique opportunity to learn what data science challenges are occurring in the Earth science community. Both projects proposed to explore web services, centralized databases and cloud solutions to enable data intensive discoveries in hydrological sciences. But as the teams set to work, we discovered an early challenge: the datasets themselves lacked the standardization and formatting needed before we could build more sophisticated tools.

Several open-source Python libraries came to the rescue. We used xarray to read and restructure our gridded climate datasets, and Pandas to convert the gridded products into tabular series for input to a hydrological model. Both packages offer high level visualization and data manipulation tools that enabled us to quickly discover errors and inconsistencies in the data. The integration of xarray with Dask, a parallel computing library, allowed us to read high resolution datasets that were larger than what could be conventionally loaded into memory.

After developing these software tools locally, we deployed them using cloud-based storage accounts and virtual machines. We utilized cloud compute time generously provided both by Microsoft’s Azure for Research Program and Amazon Web Services. With these tools we could manipulate the full extent of our data and test multiple implementations of our models.

An urgent challenge in Earth science studies is in distributing our datasets and model output to the people who can use them the most. To this end our teams worked to build a series of Application Programming Interface (API) web services. The concept of APIs is simple: a user makes a request for a product such as a dataset or plot through a web call. The API software then acts as “middleware”, linking backend data mining tools, located on a single centralized cloud database, with frontend web visualizations.  

APIs have the potential to revolutionize how different scientific teams can share their data and results in a consistent, standardized fashion. When combined with hydrological modeling tools deployed in the cloud, we can begin to bring model results more immediately to local stakeholders. All that is required is an internet connection.

We learned much from this year’s incubator program. The Earth science community would benefit from efforts to standardize the format of datasets and model output. This will help minimize the reinvention of data ingest software each time a team decides to work with a particular product. We strongly encourage the sharing of software tools in an open source environment and the utilization of non-proprietary libraries. There is enormous emerging potential for offering the results of our data analysis and modeling efforts to the broader public through the development of web services and visualization tools. Other cloud computing solutions, such as the use of virtual machines for running hydrological model simulations, come at a cost that must be balanced against purchasing one’s own server hardware.  

In the future, the eScience Institute hopes to continue offering education and advice to help researchers navigate the challenges of big data in the Earth sciences. For more information, and to apply to an upcoming incubator program, contact us!

— Anthony Arendt holds a joint appointment at the University of Washington as a Research Data Science Fellow (eScience Institute) and a Senior Research Scientist (Applied Physics Lab).