Supercomputer-powered training puts machine learning on the fast track
Connecting state and local government leaders
New software from researchers at Oak Ridge National Laboratory speeds up the design and training of neural networks.
Neural networks offer the promise of making sense of massive datasets, but only if they can be trained to know precisely what to look for in order produce valid results.
Creating such networks is labor intensive, taking months to organize. It involves ensuring the right layers -- convolution layers, pooling layers, interconnected layers -- and the right number of layers combine in a way that, in the end, can accurately classify what’s in the data the network is intended to analyze.
“There is a lot of design process to architecting what the network looks like,” said Travis Johnston, a postdoctoral researcher at Oak Ridge National Laboratory.
Johnston and his ORNL colleague Steven Young have been tapping into the lab’s Titan supercomputer to automate the design of neural networks. They have now developed two pieces of code that – when running on the lab's supercomputer -- dramatically speed up the processes.
The Multi-node Evolutionary Neural Networks for Deep Learning (MENNDL) has been under development for a couple years and is a genetic algorithm that uses selects the best performing network to create a next-generation version, eventually optimizing until the best solution has evolved. The other code, RAvENNA, is for “more fine-grained tinkering,” Johnston told GCN.
MENNDL builds the network from the ground up, making decisions on the number of layers and what each layer will do. It starts with randomly guessing how to assemble to networks and then tests the networks against the datasets it’s being built to analyze.
RAvENNA takes these macroscale network suggestions and provides more micro-level adjustments, like the number of neurons on a layer.
Both tools can generate and train as many as 18,600 neural networks simultaneously and achieved a peak performance of 20 petaflops on Titan, ORNL officials said. In practical terms, that translates to training 40,000-50,000 networks per hour.
Researchers at Fermilab had been working for three months on creating a neural network for their research observing neutrinos going through detectors. ORNL used its two new codebases along with 4,000 nodes of Titan to create a better network in 24 hours, Johnston said.
“They had a pretty decent network,” he said. “But MENNDL was able to come up with one from scratch that dramatically outperformed what they had done.”
This means scientists can spend more time on research than on building neural networks. Running MENNDL and RAvENNA on a supercomputer dramatically brings down the time-to-solution variable, Young said.
The process is limited by the number of networks that can be evaluated and trained in a given period of time, Johnston said. When ORNL brings its Summit supercomputer online in the coming months, efficiency will increase even more.
There will be more and better GPUs in Summit that will be able to test more and larger networks. "Out of the box, without tuning to Summit's unique architecture, we are expecting an increase in performance up to 50 times," Johnston said.
The researchers also plan to combine MENNDL and RAvENNA into a single piece of software, so RAvENNA can directly refine the results from MENNDL. They also want to change the backend they use to train the network to allow for more flexibility in what the networks look like, Johnston said.
Researchers outside ORNL have shown interest in the applications, Johnston said, so they’re looking at software licensing and open source possibilities.