Fermilab Computing Division

CS Document 2190-v0

Fermilab Distributed Monitoring System (NGOP)

Document #:
CS-doc-2190-v0
Document type:
Conference
Submitted by:
Selitha Raja
Updated by:
Selitha Raja
Document Created:
14 Jun 2007, 16:16
Contents Revised:
14 Jun 2007, 16:16
Metadata Revised:
14 Jun 2007, 16:16
Viewable by:
  • Public document
Modifiable by:

Quick Links:
Latest Version

Abstract:
A Distributed Monitoring System (NGOP) that would scale to the anticipated requirements for Run II computing has been under development at Fermilab. NGOP provides a framework to create Monitoring Agents for monitoring the overall state of computers and software that are running on them. Several Monitoring Agents are available within NGOP that are capable of analyzing log files, and checking node reset and reachability, existence of system daemons, cpu and memory utilization, file system existence and size, baseboard temperature, fan speeds, etc. NGOP also provides customizable graphical hierarchical representations of the monitored systems. NGOP is able to generate events when the serious problems have occurred as well as to raising alarms when potential problems have been detected. NGOP allows performing corrective actions or sending notifications. NGOP provides persistent storage for collected events, alarms and actions.

A first implementation of NGOP was recently deployed at Fermilab. This is a fully functional prototype that satisfies most of the existing requirements. It is written primarily in Python (with some modules written in C) and use XML for all configuration description. For the time being the NGOP prototype is monitoring 4 Linux farms that consist of 300 nodes. During the first few months of running, NGOP has proved to be a quite useful tool. Multiple problems such as node resets, offline cpu, and dead system daemon, have been detected. The alarms that were raised in cases of high swap/memory utilization, disk errors, NFS timeouts allowed preventive measures to be taken and provided system administrators with information required for better system tune up and configuration.

Current state of deployment and future steps to improve the prototype and implement some new features will be presented.

Files in Document:
None
Associated with Events:
CHEP2001 held on 03 Sep 2001 in Beijing, China
DocDB Home ]  [ Search ] [ Authors ] [ Events ] [ Topics ]

DocDB Version 8.8.9, contact Document Database Administrators