IRS beefs up compute power for fraud detection
Connecting state and local government leaders
When IRS ran a test of its fraud-detection algorithm on a GPU-accelerated platform across 4 terabyes of data, processing speeds increased 10-fold.
The IRS’ Office of Research, Applied Analytics and Statistics (RAAS) increased processing speeds 10-fold during a pilot test of a GPU-acceleration platform.
For the test, IRS built a cluster for its fraud-detection algorithm and ran a dataset of about 4 terabytes against it.
“Our expectation was we were definitely going to see some speed-up in computational processing,” but we didn’t expect that much, Joe Ansaldi, RAAS chief technical branch leader said during an Aug. 5 online event discussing the GPU-accelerated Cloudera Data Platform (CDP) on NVIDIA-certified systems.
RAAS has struggled with having the infrastructure to support data mining on its “troves of data,” he added. “Our biggest challenge is definitely the infrastructure to support all the ideas that the subject-matter experts are coming up with,” he said. The SMEs want to “dive in deeper within the algorithm space,” creating and training algorithms and expanding the number of parameters each of those algorithms has, he said.
Cloudera and NVIDIA aim to address key challenges around accelerated data science and artificial intelligence with the CDP powered by NVIDIA computing.
According to Scott McClellan, senior director of NVIDIA’s data science product group, data science starts with a preproduction, or ideation, phase in which researchers iteratively work through data engineering challenges such as aggregation and cleaning to produce labeled data they can use to train machine learning models or feed analytics pipelines.
The faster this phase moves, the more productive data scientists can be in the production phase, when pipelines are running continuously at scale. In that point, researchers may be dealing with individual datasets that are hundreds of gigabytes or terabytes in size and produced on an hourly basis, leading to total data processing in the hundreds of terabytes to petabytes range, McClellan said at an Aug. 3 online event.
CDP is a hybrid multicloud platform that can run in a data center, in the cloud or across both. It has a Shared Data Experience layer that handles security, governance, lineage, migration and metadata. Running on that is the Cloudera Accelerated Runtime with NVIDIA RAPIDS, a suite of open source software libraries and application programming interfaces for executing data science pipelines entirely on GPUs. On top of that sits user-facing data services.
With the services, users can run life cycle data, stream data, store and enrich data, report on it, build operational applications and build ML and AI models, said Sushil Thomas, vice president of ML at Cloudera.
“Cloudera integrates this functionality into CDP so that all of our customers and all of that data have access to that acceleration without making any changes to their application, and that’s really important,” Thomas said. “Customers don’t have to go and rewrite things; their existing applications work.”
“It lets our customers do more. On the ML side, this means more compute power and model-creating and getting to better model accuracy,” he added. “On the data engineering side, it means accelerated processing with five times or more full-stack acceleration when you think about an end-to-end data science workload. This means you can do five times more with the same data center footprint.”
At RAAS, Ansaldi said the subject-matter expert who worked on the fraud-detection algorithm will continue running that and that the department plans to procure NVIDIA A100 Tensor CORE GPU cards, which accelerate data center compute abilities, to expand testing.
“With the pilot now, we have an idea of the infrastructure we need going forward and also just in terms of thinking of the algorithms and how we can approach these problems to actually code out the solutions to them,” he said. “Now the shackles are off and we can just run to our heart’s desire.”