In the first blog post of this series, we tested several tools for evading a static machine learning-based malware detection model. As promised, we are now taking a closer look at the EMBER dataset and feature engineering techniques for creating a detection model.
This blog series is based on my bachelor thesis, which I wrote in summer 2020 at ETH Zurich. In the first post, we saw how machine learning malware detection models work and how to evade them. We set up a test environment by creating a model that was trained with the EMBER data set and used the Jigsaw ransomware to query it. Our goal was to reduce the prediction score of our model to a value below 0.5 by modifying our malware with commonly used evasion techniques (if you have read part one, you already know that the output of a trained malware detection model is a probability value between 0 and 1, with values closer to 0 for benign behavior and values closer to 1 for malicious behavior). By carefully analyzing the results of our model, we were able to achieve this goal with the help of some programming experience and a runtime packer. In the end, our “best” version was even able to reduce the score to VirusTotal from 64 to 14.
In Part 2 we will learn which features are used and how they are fed into a malware detection model to classify an input file. After a short theory part, we try to find out which features are especially important for malware analysis and how to modify them. Finally, we will change some of the features of our ransomware to evade our model. But before all this, it is advisable to get familiar with the file format used by our malware.
PE File Format
Since our Jigsaw ransomware (like most malware) targets Windows operating systems, it’s useful to take a look at the Portable Executable (PE) format. Similar to the ELF format used on Linux and most other Unix versions, 32-bit and 64-bit versions of Windows operating systems use the PE format for executable files. The PE format collects the information necessary for the Windows operating system to manage the packaged executable. Files with extensions such as .exe and .dll are famous representatives of this file format.
Most of the headers shown in the image above are used by the dynamic linker and specify how the PE file should be mapped to memory at runtime. In the next section, we will see how these headers are used by machine learning models for detecting malware.
Features and Models
Generally speaking, a feature is a measurable property of an object. In the case of malware classification, for example, file size and timestamp are important features that can be used by a machine learning model. Features mostly appear in the form of numbers, but letters or even whole vectors can represent features. In order to feed all important features into a machine learning model, they are put into a single vector, also called a feature vector. But how do you get from a feature vector to a result like malicious or benign?
For this purpose machine learning models are used, which analyze the feature vector and draw conclusions based on the information obtained. There are different types of models, such as neural networks or decision trees. Our malware detection model uses a decision tree as a predictive model (LightGBM) to go from the input file to its result.
As in the figure above, the features (i.e. gender, age, etc.) of a feature vector are examined and analyzed in a decision tree model to obtain a result. The bigger the tree the more likely it is that a feature will be queried several times. The path from the root to the result node is called the decision path.
Now that we know what a feature is and how decision trees work, it is time to take a closer look at the features used by our model.
Feature engineering
The following section gives a high-level overview of the features that are contained in our feature vector. In EMBER, the LIEF project is used to parse an input file and extract features from it. This process generates a feature vector of length 2381 that can be fed into the model. Since we cannot describe all those features individually, we categorize them into nine groups of similar features, each of which is briefly described below.
Byte Histogram
The first group consists of a normalized histogram containing 256 integer values, representing the counts of each byte value in the file.
#Feature in Feature Vector | Feature name | Description |
1-256 | Byte histogram | Byte histogram of the PE file |
Byte-Entropy Histogram
The second group shows a histogram representing the entropy distributed over the whole input file.
#Feature in Feature Vector | Feature name | Description |
257-512 | Byte entropy histogram | Byte-entropy histogram of the PE file |
This part is for math lovers and can be skipped as it is not necessary for understanding this blog. It describes the process of creating the byte entropy histogram:
This histogram simulates the joint distribution of entropy and byte value over the whole input file. This can be achieved by pairing the computed scalar entropy for a fixed-length window with each byte occurrence in the corresponding window. This process is repeated while the window passes through all input bytes. The EMBER implementation uses a window size of 2048 bytes and a step size of 1024 bytes. In the last step, the entropy and byte value are quantized into 16×16 bins. These 16×16 values are concatenated to a vector of length 256.
H. Anderson and P. Roth, “EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models,” ArXiv e-prints, April 2018
String information
The next group contains simple statistics about printable strings that are at least five printable characters long, including a histogram of printable characters. For example, the number of strings that begin with C:\ (file system path) or the number of occurrences of HKEY_ (registry key) belong to this group.
#Feature in Feature Vector | Feature name | Description |
513-616 | String information features | Information about the used strings |
General file information
This group represents basic information about the executable (e.g. file size and the number of symbols).
#Feature in Feature Vector | Feature name | Description |
617-626 | General file information features | General information about the input file |
Header information
Information about the COFF and the Optional Header belongs to this group (e.g timestamp).
#Feature in Feature Vector | Feature name | Description |
627-688 | Header information features | Information about COFF and the Optional Header |
Section Information
Properties like entropy, virtual size, and the entry name of each section are in this group.
#Feature in Feature Vector | Feature name | Description |
689-943 | Section information features | Properties of the used sections |
Imported functions
This group contains a set of unique libraries hashed to 256 bins, and a set of individual imported functions hashed to 1024 bins.
#Feature in Feature Vector | Feature name | Description |
944-2223 | Imported function features | Information about imported functions and the imported libraries |
Exported functions
Like the imported functions, the exported functions are hashed to 128 bins.
#Feature in Feature Vector | Feature name | Description |
2224-2351 | Exported function features | Information about exported functions |
Data Directories
The last group includes the size and the virtual address of the first 15 data directories (export table, import table, resource table, etc.).
#Feature in Feature Vector | Feature name | Description |
2352-2381 | Data directories features | Information about the used data directories |
Feature importance
This overview above is all well and good, but now we have to find out which features are more important than others and if there are features that are not considered in our model at all.
To answer this question we use the feature_importance function provided by the model itself. Since we work with a decision tree model, it is relatively easy to measure the “importance” of the used features. To determine the most important features, two approaches can be followed; split and gain.
Split
In a decision path of a trained Decision Tree Model, a feature can be considered several times. With feature_importance=split we can determine the most used features in our model:
Surprisingly, the timestamp is the most important feature in the EMBER model if split is used. The hashed entropy of the sections and the general file size are also placed high up.
Gain
Unlike split, feature_importance=gain does not return the number of most-used features, but the features that have the most influence on the decision about benign or malicious if they are used once.
Looking at the results, besides the size, the virtual size, and the entropy of the file, the DLL flag in the COFF-header is one of the most important features, if gain is used. Whether an entry exists in the certificate table, i.e. the file is signed, seems to influence the classification between benign and malicious too.
If you would like to learn more about how the gain is calculated, you can find more information at https://xgboost.readthedocs.io/en/latest/tutorials/model.html
Evasion
Equipped with the knowledge from the feature engineering analysis, we can once again take our ransomware sample from Part 1 and precisely modify it for evading detection. As in part one, we start with an unmodified Jigsaw ransomware version, which was classified as malicious by our model with a score:
Evasion | Score of Model |
none | 0.999094562 |
Our goal is to reduce the model score to a value below 0.5. To achieve this, we especially investigate the effects of changing the entropy, signature, DLL flag, and timestamp of our ransomware.
Entropy
Based on our results, the entropy of an input file (b’section_44.bin_entropy_hashed’, b’256.bin_byte_entropy_histogram’…) seems to play a major role for the malware classification. Recall from Part 1 one that the .netshrink modification even increased the return value of our model above the original start value:
Evasion | Score of Model |
.netshrink | 0.999518866 |
Since a compression algorithm usually increases the entropy of an input file, this could be a reason for the higher score. This leads to the assumption that our model interprets input files with high entropy as packed software (probably malware) and thus provides a higher value. Because it is quite difficult to reduce the entropy of a file, we will henceforth only make changes that do not increase the entropy significantly.
Certificate
If we look at the results of split and gain, we see that b’dd_certificate_table_size’ occurs in both tables. In other words, signing the file seemingly influences the classification of the input file. Unfortunately, we can’t sign our malware with Microsoft’s private key, but we can create our own certificate and sign our Jigsaw ransomware with it. Maybe our model only verifies if there is an entry in the Certificate Table or not, no matter which certificate is used.
Evasion | Score of Model |
self-signed | 0.966785917 |
The value is lower than at the beginning but still far away from our goal. Let’s move on.
DLL
According to our results, the DLL flag in the COFF header is one of the most important features. In Part 1, when we compiled our Malware into a DLL, we got good results. Here we simply want to add the DLL flag to our already compiled ransomware. For this purpose, we work with the PE Explorer. Although this small modification breaks our Jigsaw ransomware executable, it is still worth giving it a try.
Evasion | Score of Model |
DLL-flagged | 0.929116013 |
Apparently, it is not enough to set only the DLL flag to push the value below 0.5. Since our flagged ransomware has no other characteristics of a DLL, it is possible that it was never asked for the DLL flag in the decision path.
Timestamp
According to the split results, the timestamp is the most used feature in our tree. In my opinion, two approaches make sense here:
- We set the timestamp of our malware to zero (i.e. 01.01.1970, where no malware existed).
- We select a benign sample from the training set and replace the time stamp in our Jigsaw ransomware with the timestamp of the benign sample.
Once again we use PE Explorer to set the timestamp of our malware:
Evasion | Score of Model |
zero timestamp | 0.017957451 |
benign timestamp | 0.095834957 |
The timestamp of a file is apparently more important than it seems at the beginning. With this small modification, we managed to reach our goal and get a value far below 0.5.
Conclusion
In this blog series, our goal was to bypass a model for detecting malware through machine learning by modifying the Jigsaw ransomware. Using the Black-Box approach and the Gray-Box approach, we simulated two different attack scenarios on the model, and we succeeded in both approaches. Through feature engineering analysis, we were able to identify a blind spot on the model and exploit it. By changing the timestamp of our malware, we were able to reduce the rating of our model to a benign value. Nevertheless, this blog, like other recently published work in this area, clearly shows that the current static malware detection using machine learning cannot provide reliable results and depends on the support of the dynamic approach. If you have questions about this work, feel free to ask them in the Comment section below.
References
1. “PE Format,” Microsoft, [Online]. Available: https://docs.microsoft.com/en-us/windows/win32/debug/pe-format. [Accessed 5 5 2020].
2. H. Anderson and P. Roth, “EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models,” ArXiv e-prints, April 2018.
3. “LightGBM,” Microsoft, 2020. [Online]. Available: https://lightgbm.readthedocs.io/en/latest/index.html. [Accessed 19 5 2020].
4. “LIEF: library for instrumenting executable files,” Quarkslab, 2017-2018. [Online]. Available: https://lief.quarkslab.com/.
5. K. Weinberger, A. Dasgupta, J. Attenberg, J. Lanfgord and A. Smola, “Feature Hashing for Large Scale Multitask Learning,” ArXiv e-prints, 2009.
6. “PE Explorer,” Heaventools Software, 2020. [Online]. Available: http://www.pe-explorer.com/. [Accessed 20 5 2020].
7. “www.wikipedia.org,” 29 September 2019. [Online]. Available: https://en.wikipedia.org/wiki/Portable_Executable#/media/File:Portable_Executable_32_bit_Structure_in_SVG_fixed.svg. [Accessed 6 5 2020].
8. “www.wikipedia.org,” 29 September 2019. [Online]. Available: https:// https://en.wikipedia.org/wiki/Decision_tree_learning#/media/File:Decision_Tree.jpg. [Accessed 6 5 2020].
9. “.netshrink,” PELock LLC, 2001 – 2020. [Online]. Available: https://www.pelock.com/products/netshrink. [Accessed 15 5 2020].
Leave a Reply