Approximate matching can help find needles in haystacks
Connecting state and local government leaders
NIST is preparing a publication explaining the technique called approximate matching that helps analysts spot malicious code in files using functions that look for similarities.
Finding malicious code is not too difficult if you have a fingerprint or signature to look for. Traditional signature-based antivirus tools have been doing this effectively for years. But malware often morphs, adapts and evolves to hide itself, and a simple one-to-one match no longer is adequate.
The National Institute of Standards and Technology is developing guidance for a technique called approximate matching to help automate the task of identifying suspicious code that otherwise would fall to human analysts. The draft document is based on work of NIST’s Approximate Matching Working Group.
“Approximate matching is a promising technology designed to identify similarities between two digital artifacts,” the draft of Special Publication 800-168 says. “It is used to find objects that resemble each other or to find objects that are contained in another object.”
The technology can be used to filter data for security monitoring and for digital forensics, when analysts are trying to spot potential bad actors either before or after a security incident.
Approximate matching is a generic term describing any method for automating the search for similarities between two digital artifacts or objects. An “object” is an “arbitrary byte sequence, such as a file, which has some meaningful interpretation.”
Humans can understand the concept of similarity intuitively, but defining the aspects of similarity for algorithms can be challenging. In approximate matching, similarity is defined for algorithms in terms of the characteristics of artifacts being examined. These characteristics can include byte sequences, internal syntactic structures or more abstract semantic attributes similar to what human analysts would look for.
Different methods for approximate matching operate at different levels of abstraction. These range from generic techniques at the lowest level to detect common byte sequences, to more abstract analysis that approach the level of human evaluation. “The overall expectation is that lower level methods would be faster, and more generic in their applicability, whereas higher level ones would be more targeted and require more processing,” the document explains.
Approximate matching uses two types of queries: resemblance and containment. Two successive versions of a piece of code are likely to resemble each other, and a resemblance query simply identifies two pieces of code that are substantially similar. With a containment query, two objects of substantially different size, such as a file and a whole-disk image, are examined to determine whether the smaller object, or something similar to it, is contained in the large one.
As described in the document, approximate matching usually is used to filter data, as in blacklisting known malicious artifacts or anything closely resembling them. “However, approximate matching is not nearly as useful when it comes to whitelisting artifacts, as malicious content can often be quite similar to benign content,” NIST warns.
The publication lays out essential requirements of approximate matching functions as well as the factors—including sensitivity and robustness, precision and recall and security—that determine the reliability of the results.
Comments on the publication should be sent by March 21 to match@nist.gov with “Comments on SP 800-168” in the subject.