Tools and Techniques for Managing Clusters for SciDAC Lattice QCD at Fermilab

Fermilab operates several clusters for lattice gauge computing, including a 80-node Pentium III cluster, an 80-node Xeon cluster, and a 128-node Xeon cluster. Minimal manpower is available to manage these clusters. We have written a number of tools and developed techniques to cope with this task. We will describe our tools which use the IPMI facilities of our systems for hardware management tasks such as remote power control, remote system resets, and health monitoring (temperatures, fan speeds, voltages). We will also discuss our techniques involving network booting for installation and upgrades of the operating system on these computers. Similar network booting techniques are used to reload BIOS and other firmware. Finally, we will discuss our tools for parallel command processing and file copying, as well as their use in monitoring and administrating the PBS batch queue system used on our clusters.
CHEP2003 held on 24 Mar 2003 in La Jolla, California
