Lawrence Livermore erects HPC test bed
Connecting state and local government leaders
Novel agreement shares supercomputer deployment costs between government and vendors.
One of the biggest challenges for managers of high-production
computer systems is testing new code or hardware before it is
rolled out across thousands of nodes. Something may work fine on a
single server, but it could collapse as it is extended across many
more nodes. Compounding the problem is that supercomputer cycles
are so in demand these days that getting time on the big iron for
testing purposes isn't very likely.
One possible solution to this challenge is a new, modestly sized
supercomputer being put into place by the Energy Department's
Lawrence Livermore National Laboratory (LLNL). The system, dubbed
Hyperion, will have 1,152 nodes with 9,216 processor cores, which
should give the machine the ability to execute about 100 trillion
floating-point operations per second (TFlops).
Although it's modest when compared the latest round of the Energy Department's petascalemachines, it is large enough to test how well software and
hardware scale will scale, said Mark Seager, LLNL project leader.
"Scale is a big deal," he said.
To help pay for the costs of building Hyperion, LLNL has brought
in vendors who have agreed to contribute equipment. In return, they
get a portion of the computing time.
Dell Inc., Intel Corp., Supermicro, QLogic, Cisco, Mellanox,
DataDirect Networks Inc., Sun Microsystems, LSI Storage Systems and
Red Hat all contributed to the new system.
This cluster will be "bigger than any one of us could build
alone," Seager said. "It will primarily be used at [testing]
hardware and software infrastructure."
"The engineering guys typically don't have budgets for
computers, and we want them to have access to this stuff so they
can test [their work] at scale," he continued. Seager noted that
too many times a piece of hardware or software won't scale when
it's introduced into a production environment, and the
operational folks will have to troubleshoot the system while it is
running production jobs. "Building this test bed will provide these
resources upfront," he said.
The approach of getting vendors to share the cost of building a
test bed is a novel one, at least in government. Such a system, if
acquired outright, would probably cost around $20 million, Seager
said,
The lab asked the vendors to discount the equipment, as well as
buy some of the equipment outright. As a result, LLNL ended
spending about $5.5 million, and vendors contributed another $5
million as well.
Vendors tend to have marketing budgets that they are not sure
how to spend, and so Seager argued that devoting some of this money
to building a test bed can help the company develop new
technologies, as well as market their wares.
For their participation, vendors are guaranteed a certain amount
of time on the system. They own have a small portion of the machine
outright, which can be used anytime. The design of the system
allows researchers to carve out a section of the system for testing
new hardware or software in such a manner that if the test crashes,
the whole system won't go down. And they also get with some time on
the entire machine. Vendors can run benchmarks and application
scalability runs. They can even bring in additional equipment and
plug them into the Infiniband backbone.
Seager expects to complete building Hyperion in March 2009; the
first half of the machine is already in place and running.
The system will allow the Energy Department laboratory to test
and compare new technologies, which should inform purchasing
decisions for larger machines. One early round of testing will
determined whether to use Infiniband or Ethernet transport for
storage area networks. Hyperion will have two storage area
networks, one based on Infiniband and one based on 10 Gigabit
Ethernet.
The lab will also use the system to test and refine the Lustrefile system, an open source version of Infiniband, a cluster version
of Red Hat Enterprise Linux, as well as various other cluster
software used by Energy Department labs.
For both the labs and the participating vendors, Hyperion will
make it "easier to test and develop [new technologies] without
buying more infrastructure," said Omar Sultan, a Cisco senior
solution manager.
Cisco provided its Nexus 5000 and Nexus 7000 series switches to
the project. Sultan said he was unaware of any projects that Cisco
itself will carry out using the Hyperion. He did note that the use
of Nexus switches within an HPC system will serve as a marketing
tool for the company, allowing Cisco to demonstrate the
switches' potential usefulness for other HPC systems. The
switches, which direct traffic across the nodes, can scale from 10
Gigabit Ethernet to the experimental 40-Gigabit and 100-Gigabit Ethernet by swapping out a line card.
In addition to testing infrastructure products, the system can
also test how well software programs will operate on large
machines. Researchers at Stanford University are working modeling a
hypersonic aircraft, so they can better determine why scramjet
engines can suddenly stop running. The three-dimensional modeling
code for representing the engine can scale up to 4,072 cores. They
were able to do a test run on Hyperion to ensure it will be ready
when they get a time slot on one of the production machines that
operate on behalf of the Energy Department's Accelerated Strategic
Computing Initiative.
NEXT STORY: Higher-tech recruiting