Selitha Raja
Selitha Raja
21 Jun 2007, 16:04
21 Jun 2007, 16:04
21 Jun 2007, 16:04
  • Public document
A Distributed Monitoring System (NGOP) that scales for Run II computing has been developed at Fermilab. It provides active monitoring of software and hardware, customizable service-level reporting, early error detection, and problem prevention. NGOP provides persistent storage of collected data and is capable of executing corrective actions and sending notifications. NGOP is a framework for developing Monitoring Agents for monitoring the overall state of computers and software that are running on them. Several Monitoring Agents are available within NGOP that are capable of analyzing log files, and checking existence of system daemons, CPU and memory utilization, availability of web pages, etc. For the time being the NGOP is monitoring about 1500 nodes and 35000 objects. NGOP has proved to be a useful tool: multiple problems such as node resets, offline CPUs, hard drives errors, nfs problems and dead system daemons have been detected. NGOP provided system administrators with information required for better system tuning and configuration. The NGOP architecture and the current state of deployment will be presented.
CHEP2003 held on 24 Mar 2003 in La Jolla, California
