Fermilab Computing Division

Sharing Computational Resources on the Grid: How to do it Reliably and Unselfishly

Simple document list
(1 extra document)

Full Title: Sharing Computational Resources on the Grid: How to do it Reliably and Unselfishly
Date & Time: 06 Jan 2011 at 14:00
Event Location: FCC1 Conference Room
Event Moderator(s):
Event Info: Speaker: Saurabh Bagchi, Associate Professor in the School of Electrical and Computer Engineering and the Department of Computer Science, Purdue University


A Fine-Grained Cycle Sharing (FGCS) system aims at utilizing the large amount of idle computational resources available on the Internet. In such a cycle sharing system, PC owners voluntarily make their CPU cycles available as part of a shared computing environment. There are many popular examples of FGCS systems being used today, such as, Climateprediction.net and the World Community Grid and the computing infrastructure is provided through widely used open-source codebases, such as Boinc (Berkeley Open Infrastructure for Network Computing). The users make their computing resources available under the condition (implicit or explicit) that their own processes (host processes) incur no significant slowdown from letting a foreign job (guest process) run on their own machines To exploit available idle cycles under the restriction posed, an FGCS system allows a guest process to run concurrently with the host processes, but with resource fencing for the guest processes. For guest users, these free computation resources come at the cost of fluctuating availability due to “failures”. Here we define failures to be due either to the eviction of a guest process from a machine due to resource contention, or due to conventional hardware and software failures of a machine.

The primary victims of such resource volatility are large compute-bound guest programs whose completion times fluctuate widely due to this effect. Most of these programs are either sequential or composed of several coarse-grained tasks with little communication in between. To achieve high performance in the presence of resource volatility, checkpointing and rollback have been widely applied. These techniques enable an application to periodically save a checkpoint - a snapshot of the application’s state - onto a stable storage that is connected to the computation node(s) through a network. A job may get evicted from its execution machine any time and can recover from this failure by rolling back to the latest checkpoint.

Most production FGCS systems, such as Condor, store checkpoints to dedicated storage servers. These are few in number, are well-provisioned, and maintained such that 24×7 availability is achieved. This solution works well when a cluster only belongs to a small administrative domain or there are a large number of storage servers. However, it does not scale well with the growing sizes of grids. We have therefore designed a system to allow storage of checkpoints on the shared nodes, while taking care of the volatility that may occur due to the checkpoint nodes becoming unavailable. Further, we have found that checkpoint sizes of several prominent applications grow to hundreds of Megabytes and beyond and storing them on shared resources puts a burden. Therefore, we have developed a technique to collate checkpoints from multiple nodes and compress them to reduce the sizes of checkpoints that must be stored.

In this talk, I will discuss the design principles we followed, the empirical evaluation we did on Purdue’s campus-wide Condor grid, and the insights we obtained from them.

Speaker Bio:

Saurabh Bagchi is an Associate Professor in the School of Electrical and Computer Engineering and the Department of Computer Science at Purdue University in West Lafayette, Indiana. He is a senior member of IEEE and ACM, a "Teaching for Tomorrow" faculty fellow at Purdue University and the Assistant Director of the CERIAS security center at Purdue. He is the PC chair for IEEE/IFIP International Symposium on Dependable Systems and Networks (DSN) in 2011. He received the MS and PhD degrees from the University of Illinois, Urbana-Champaign, in 1998 and 2001, respectively. At Purdue, he leads the Dependable Computing Systems Laboratory (DCSL), where he and a set of wildly enthusiastic students try to make and break distributed systems for the good of the world.

No talks in agenda

Other documents for this event

CS-doc-# Title Author(s) Topic(s) Last Updated
4212-v2 Sharing Computational Resources on the Grid: How to do it Reliably and Unselfishly Saurabh Bagchi Computing Techniques Seminars
14 Jan 2011

DocDB Home ]  [ Search ] [ Authors ] [ Events ] [ Topics ]

DocDB Version 8.8.9, contact Document Database Administrators
Execution time: 1 wallclock secs ( 0.17 usr + 0.03 sys = 0.20 CPU)