Earth Observation Scientific Workflows in a Distributed Computing Environment

Session Type: 
Academic Session
Presenter(s): 
Dr Terence L van Zyl

 

Scientific workflows offer a promising paradigm to facilitate researchers, in the earth observation domain, with many aspects of the scientific process. One such aspect is that of access to distributed computing and earth observation data and processing resources. Earth observation research often utilises large datasets requiring extensive CPU and memory resources in their processing. These resource intensive processes can be chained; the sequence of processes (and their provenance) makes up a scientific workflow. Despite the exponential growth in capacity of desktop computing, resources available on such devices are often insufficent for the scientific workflow processing tasks at hand. By integrating distributed computing capabilities into a geospatially-enabled scientific workflow environment, it is possible to provide researchers with a mechanism to overcome the limitations of the desktop computer. The majority of effort in regard to extending scientific workflows with distributed computing capabilities has focused on the web services approach as exemplified by the OGC's Web Processing Service and by GRID computing. The approach to leveraging distributed computing resources described in this paper uses instead object remoting via RPyC and the dynamic properties of the Python programming language. The Vistrails (http://www.vistrails.org) environment has been extended to allow for geospatial processing through the EO4Vistrails package (http://code.google.com/p/eo4vistrails/). In order to allow these geospatial processes to be seamlessly executed on distributed resources such as cloud computing nodes, the Vistrails environment has been extended with both multi-tasking capabilities and distributed processing capabilities. Types of extensions include remote execution of PostGIS queries, the seamless remoting of Numpy and the integration into NetworkX and PySAL on distributed computing resources. The paper describes the broader architecture, lessons learnt and various strengths and weaknesses of these alternate approaches. These lessons included discovering limitations of inter process communication and the implications of the use of the C programming language underpinning the performance in many FOSS4G softwares. The paper completes by describing the future efforts and extensions that should be addressed to improve the current solution.

 

Speaker Bio: 

 Dr van Zyl is a senior researcher at CSIR where he performs research into spatial temporal data analytics

Schedule info