From similarity to attribution: Machine learning-driven malware analysis
By Alex Franko, Markus Tuominen, Mohammad Kazem Hassan Nejad, Dmitriy Komashinskiy
Introduction
With the rapid emergence of new malware variants, accurately classifying and attributing malware samples has become more challenging than ever. To address this, WithSecure developed a machine learning model that classifies Windows binaries and identifies connections between similar samples. This model evaluates whether a submitted file is likely to be clean or malicious. It also outputs five similar samples it recognizes, helping analysts find connections between samples efficiently.
The model's similarity feature was integrated with OpenCTI, an open-source threat intelligence platform. Now, analysts can gain deeper insights into each analyzed sample by investigating the related samples provided by the model. This similarity matching improves the ability to classify and attribute malware, providing clearer insights into the origins and relationships of each sample.
Model overview
WithSecure leverages machine learning to detect cyber threats. Through decades of work, WithSecure has developed infrastructure, data collectors, and analysis tools that enable AI-driven threat detection. One of the tools is a model that analyzes static features in Windows Portable Executables by converting them into numerical arrays (also referred to as vectors). When different executables produce similar array representations, they likely receive similar verdicts, allowing analysts to build a search index that finds connections between new samples and known malicious files.
OpenCTI integration
The machine learning model was integrated into OpenCTI as a “connector”. Connectors are additional components for OpenCTI whose job is to bring in data from external sources. The connector works by enriching file observables with links to similar samples existing on the platform. The platform gives additional context to the similar samples which can be used for further pivoting. In certain cases, there may be insufficient information directly linked to the sample at hand, however by pivoting on similar samples, an analyst may discover new additional information surrounding these samples that were otherwise not present with the original sample.
Figure A. A file observable on the OpenCTI platform which has been enriched with the malware similarity connector. The most similar sample is highlighted in red.
Figure B. The page of the most similar file sample in OpenCTI. It shows that it was related to an Xworm RAT intrusion incident.
Real world example
Lockbit
One of the main objectives while analyzing an unknown malware sample is to identify if it belongs to a known malware family. In the example shown in figure C, an unknown sample was submitted to the model which returned 5 similar samples with relatively short distances. When looking up those 5 similar samples through various sources (an example shown in figure D), all the similar samples were identified as Lockbit 3 (also known as Lockbit Black), therefore it could be deduced that the submitted sample is a Lockbit 3 variant as well.
Figure C. Model output (including 5 similar samples) for the submitted unknown sample
Figure D. VirusTotal attribution on similar samples
Take aways
The machine learning model represents an advancement in malware analysis by combining automated classification with similarity detection. Through the OpenCTI integration, analysts and cyber incident investigators can quickly identify new malware variants and understand their relationships to known threats and associated threat actors. As demonstrated by the Lockbit example, this approach can accelerate malware family identification and also enables pivoting investigations through similar samples. In an environment where threat actors constantly evolve their tactics, tools like this model enhance analysts’ capabilities and are becoming increasingly beneficial for effective cyber defense.