Watching the servers
Connecting state and local government leaders
How do you make sure all the nodes in your Linux super-cluster are humming along?
How do you make sure all the nodes in your Linux super-cluster are humming along smoothly? The Energy Department's Sandia National Laboratories had to write special software to do the job. Now, the agency is offering the resulting program, called Ovis (GCN.com/731), under an open-source license.
Ovis oversees the performance and health of individual computers within Sandia's 8,960-node Thunderbird system, capable of performing 53 trillion floating-point operations per second. The software collects and correlates a number of environmental conditions, such as CPU temperatures, fan speeds, memory error rates, room temperatures and airflows.
The software is different from commercial computational platform monitoring and analysis products in that it offers a statistical approach to determining abnormal performance, said Philippe Pebay, a member of the Sandia technical staff who helped develop the software. Commercial products, which tend to be built for specific equipment, monitor for absolute thresholds, and are often set at a high level to prevent false positives, Pebay said. 'They were not able to address some of the problems we were facing.' Ovis compares individual nodes with one another to derive a statistical norm for all the nodes in a cluster.
For instance, the software can tell when a CPU is running too cool, in addition to when it overheats. This information can be valuable because a cool CPU can indicate its cooling fan is running constantly and will burn out sooner than anticipated.
The software collects environmental information from the servers through a variety of means. In some cases, it uses the Intelligent Platform Management Interface, an industry hardware reporting specification. It also uses vendor-specific metric-collecting tools from companies such as Hewlett-Packard Co. and Linux Networx Inc.
In cases where no commercial metric-gathering tools are available, the team crafted scripts that collect data from the sensors and write the results into text files.
Pebay said lab personnel are hoping other federal agencies use the software and offer feedback and improvements. The software was issued under the Berkeley Software Distribution open-source license.