Fermilab Computing Division

CS Document 2262-v1

Tools and Techniques for Managing Clusters for SciDAC Lattice QCD at Fermilab

Document #:
CS-doc-2262-v1
Document type:
Presentation
Submitted by:
Selitha Raja
Updated by:
Selitha Raja
Document Created:
27 Jun 2007, 13:32
Contents Revised:
27 Jun 2007, 13:32
Metadata Revised:
27 Jun 2007, 13:32
Viewable by:
  • Public document
Modifiable by:

Quick Links:
Latest Version

Abstract:
Fermilab operates several clusters for lattice gauge computing, including a 80-node Pentium III cluster, an 80-node Xeon cluster, and a 128-node Xeon cluster. Minimal manpower is available to manage these clusters. We have written a number of tools and developed techniques to cope with this task. We will describe our tools which use the IPMI facilities of our systems for hardware management tasks such as remote power control, remote system resets, and health monitoring (temperatures, fan speeds, voltages). We will also discuss our techniques involving network booting for installation and upgrades of the operating system on these computers. Similar network booting techniques are used to reload BIOS and other firmware. Finally, we will discuss our tools for parallel command processing and file copying, as well as their use in monitoring and administrating the PBS batch queue system used on our clusters.
- Simon Epsteyn, Don
Files in Document:
Associated with Events:
CHEP2003 held on 24 Mar 2003 in La Jolla, California
DocDB Home ]  [ Search ] [ Authors ] [ Events ] [ Topics ]

DocDB Version 8.8.9, contact Document Database Administrators