Fermilab Computing Division

CS Document 2988-v0

CHEP09 - LQCD Workflow Execution Framework: Models, Provenance, and Fault-Tolerance

Document #:
Document type:
Submitted by:
Luciano Piccoli
Updated by:
Luciano Piccoli
Document Created:
12 Nov 2008, 10:43
Contents Revised:
12 Nov 2008, 10:43
Metadata Revised:
12 Nov 2008, 10:44
Viewable by:
  • Public document
Modifiable by:

Quick Links:
Latest Version

Large computing clusters used for scientific processing suffer from systemic failures when operated over long continuous periods for executing workflows. Diagnosing job problems and faults leading to eventual failures in this complex environment is difficult, specifically when the success of whole workflow might be affected by a single job failure.

In this paper, we introduce a model-based, hierarchical, reliable execution framework that encompass workflow specification, data provenance, execution tracking and online monitoring of each workflow task, also referred to as participants. The sequence of participants is described in an abstract parameterized view, which is translated into a concrete data dependency based sequence of participants with defined arguments.

As participants belonging to a workflow are mapped onto machines and executed, periodic and on-demand monitoring of vital health parameters on allocated nodes is enabled according to pre-specified rules. These rules specify conditions that must be true pre-execution, during execution and post-execution.

Monitoring information for each participant is propagated upwards through the reflex and healing architecture, which consist of hierarchical network of decentralized fault management entities, called reflex engines. They are instantiated as state machines or timed automatons that change state and initiate reflexive mitigation action(s) upon occurrence of certain faults.

We describe how this cluster reliability framework is combined with the workflow execution framework using formal rules and actions specified within a structure of first order predicate logic that enables a dynamic management design that reduces manual administrative workload, and increases cluster-productivity. Preliminary results on a virtual setup with injection failures are shown.

Files in Document:
CHEP09 LQCD workflow
Associated with Events:
CHEP 2009 held from 21 Mar 2009 to 27 Mar 2009 in Prague, Czech Republic
DocDB Home ]  [ Search ] [ Authors ] [ Events ] [ Topics ]

DocDB Version 8.8.9, contact Document Database Administrators