Fault Tolerance and Reliability in Scientific Workflows

No Thumbnail Available

Date

2005-05-31

Journal Title

Series/Report No.

Journal ISSN

Volume Title

Publisher

Abstract

The emerging technologies of web services, agents and service-oriented workflows will enable scientific projects and experiments to be conducted on a larger scale than ever before. Data used and produced in such projects and experiments become increasingly complex and heterogeneous. Thus the need for a tool (or a set of tools) to efficiently design, manage and maintain problem solving flows (scientific workflows) using various components. The DOE Scientific Data Management (SDM) initiative aims to develop a framework that helps scientists to manage data in distributed and collaborative environments. It also provides tools that help them create and manage scientific workflows that use network-based (web) services, agent technologies and semantic mediation techniques. The current SDM's framework is known as SPA/Kepler and is Ptolemy II based. One of the vulnerabilities of service dependent workflows is that they require that the web services they use to be available whenever the workflow is run. If key web services are not available, the workflow cannot finish successfully. At that point a scientist using such as service would have to wait for it to be restored, This, of course, impacts workflows reliability and availability, and may be sufficient for an end-user to stop using workflows that use those services.. The work reported here uses the SPA/Kepler framework to explore the issue of reliability of service-based scientific workflows. For example, a workflow that invokes 3 services in a series may have .an acceptably high overall failure probability. This thesis explores the issues related to improvement of the overall workflow reliability using fault tolerance. Specifically, the work focuses on failure-masking and fail-over through redundancy, and in the context of individual services, rather than on provision of checkpointing and recovery.. Analyses show that even a relatively simple redundancy based fault-tolerance approach, such as duplication of key services, can provide an order of magnitude or better reliability. In the context of an actual implementation, one option is to find locations of alternative (functionally equivalent) services during workflow design, and then use that information at run-time if the primary service fails. A more practical method is to publish the list of services used by the workflow to a UDDI type service and have a way of dynamically matching needed services with functionally equivalent ones if a fail-over is required. A prototype solution of the latter, based on a commercially available brokering service, was developed for one of the SDM pilot workflows to show its viability. It is discussed in detail.

Description

Keywords

Fault Tolerance, Reliability, Scientific Worflows, Web Services

Citation

Degree

MS

Discipline

Computer Science

Collections