Compass Security Blog

Offensive Defense

Evading Static Machine Learning Malware Detection Models – Part 2: The Gray-Box Approach

In the first blog post of this series, we tested several tools for evading a static machine learning-based malware detection model. As promised, we are now taking a closer look at the EMBER dataset and feature engineering techniques for creating a detection model.

This blog series is based on my bachelor thesis, which I wrote in summer 2020 at ETH Zurich. In the first post, we saw how machine learning malware detection models work and how to evade them. We set up a test environment by creating a model that was trained with the EMBER data set and used the Jigsaw ransomware to query it. Our goal was to reduce the prediction score of our model to a value below 0.5 by modifying our malware with commonly used evasion techniques (if you have read part one, you already know that the output of a trained malware detection model is a probability value between 0 and 1, with values closer to 0 for benign behavior and values closer to 1 for malicious behavior). By carefully analyzing the results of our model, we were able to achieve this goal with the help of some programming experience and a runtime packer. In the end, our “best” version was even able to reduce the score to VirusTotal from 64 to 14.

In Part 2 we will learn which features are used and how they are fed into a malware detection model to classify an input file. After a short theory part, we try to find out which features are especially important for malware analysis and how to modify them. Finally, we will change some of the features of our ransomware to evade our model. But before all this, it is advisable to get familiar with the file format used by our malware.

PE File Format

Since our Jigsaw ransomware (like most malware) targets Windows operating systems, it’s useful to take a look at the Portable Executable (PE) format. Similar to the ELF format used on Linux and most other Unix versions, 32-bit and 64-bit versions of Windows operating systems use the PE format for executable files. The PE format collects the information necessary for the Windows operating system to manage the packaged executable. Files with extensions such as .exe and .dll are famous representatives of this file format.

Structure of a Portable Executable 32 bit

Most of the headers shown in the image above are used by the dynamic linker and specify how the PE file should be mapped to memory at runtime. In the next section, we will see how these headers are used by machine learning models for detecting malware.

Features and Models

Generally speaking, a feature is a measurable property of an object. In the case of malware classification, for example, file size and timestamp are important features that can be used by a machine learning model. Features mostly appear in the form of numbers, but letters or even whole vectors can represent features. In order to feed all important features into a machine learning model, they are put into a single vector, also called a feature vector. But how do you get from a feature vector to a result like malicious or benign?

For this purpose machine learning models are used, which analyze the feature vector and draw conclusions based on the information obtained. There are different types of models, such as neural networks or decision trees. Our malware detection model uses a decision tree as a predictive model (LightGBM) to go from the input file to its result.

Decision tree calculating the chance of survival of a Titanic passenger

As in the figure above, the features (i.e. gender, age, etc.) of a feature vector are examined and analyzed in a decision tree model to obtain a result. The bigger the tree the more likely it is that a feature will be queried several times. The path from the root to the result node is called the decision path.

Now that we know what a feature is and how decision trees work, it is time to take a closer look at the features used by our model.

Feature engineering

The following section gives a high-level overview of the features that are contained in our feature vector. In EMBER, the LIEF project is used to parse an input file and extract features from it. This process generates a feature vector of length 2381 that can be fed into the model. Since we cannot describe all those features individually, we categorize them into nine groups of similar features, each of which is briefly described below.

Byte Histogram

The first group consists of a normalized histogram containing 256 integer values, representing the counts of each byte value in the file.

#Feature in Feature VectorFeature nameDescription
1-256Byte histogramByte histogram of the PE file

Byte-Entropy Histogram

The second group shows a histogram representing the entropy distributed over the whole input file.

#Feature in Feature VectorFeature nameDescription
257-512Byte entropy histogramByte-entropy histogram of the PE file

This part is for math lovers and can be skipped as it is not necessary for understanding this blog. It describes the process of creating the byte entropy histogram:

This histogram simulates the joint distribution of entropy and byte value over the whole input file. This can be achieved by pairing the computed scalar entropy for a fixed-length window with each byte occurrence in the corresponding window. This process is repeated while the window passes through all input bytes. The EMBER implementation uses a window size of 2048 bytes and a step size of 1024 bytes. In the last step, the entropy and byte value are quantized into 16×16 bins. These 16×16 values are concatenated to a vector of length 256.

H. Anderson and P. Roth, “EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models,” ArXiv e-prints, April 2018

String information

The next group contains simple statistics about printable strings that are at least five printable characters long, including a histogram of printable characters. For example, the number of strings that begin with C:\ (file system path) or the number of occurrences of HKEY_ (registry key) belong to this group.

#Feature in Feature VectorFeature nameDescription
513-616String information featuresInformation about the used strings

General file information

This group represents basic information about the executable (e.g. file size and the number of symbols).

#Feature in Feature VectorFeature nameDescription
617-626General file information featuresGeneral information about the input file

Header information

Information about the COFF and the Optional Header belongs to this group (e.g timestamp).

#Feature in Feature VectorFeature nameDescription
627-688Header information featuresInformation about COFF and the Optional Header

Section Information

Properties like entropy, virtual size, and the entry name of each section are in this group.

#Feature in Feature VectorFeature nameDescription
689-943Section information featuresProperties of the used sections

Imported functions

This group contains a set of unique libraries hashed to 256 bins, and a set of individual imported functions hashed to 1024 bins.

#Feature in Feature VectorFeature nameDescription
944-2223Imported function featuresInformation about imported functions and the imported libraries

Exported functions

Like the imported functions, the exported functions are hashed to 128 bins.

#Feature in Feature VectorFeature nameDescription
2224-2351Exported function featuresInformation about exported functions

Data Directories

The last group includes the size and the virtual address of the first 15 data directories (export table, import table, resource table, etc.).

#Feature in Feature VectorFeature nameDescription
2352-2381Data directories featuresInformation about the used data directories

Feature importance

This overview above is all well and good, but now we have to find out which features are more important than others and if there are features that are not considered in our model at all.

To answer this question we use the feature_importance function provided by the model itself. Since we work with a decision tree model, it is relatively easy to measure the “importance” of the used features. To determine the most important features, two approaches can be followed; split and gain.


In a decision path of a trained Decision Tree Model, a feature can be considered several times. With feature_importance=split we can determine the most used features in our model:

Most important features with feature_importance=split

Surprisingly, the timestamp is the most important feature in the EMBER model if split is used. The hashed entropy of the sections and the general file size are also placed high up.


Unlike split, feature_importance=gain does not return the number of most-used features, but the features that have the most influence on the decision about benign or malicious if they are used once.

Most important features with feature_importance=gain

Looking at the results, besides the size, the virtual size, and the entropy of the file, the DLL flag in the COFF-header is one of the most important features, if gain is used. Whether an entry exists in the certificate table, i.e. the file is signed, seems to influence the classification between benign and malicious too.

If you would like to learn more about how the gain is calculated, you can find more information at


Equipped with the knowledge from the feature engineering analysis, we can once again take our ransomware sample from Part 1 and precisely modify it for evading detection. As in part one, we start with an unmodified Jigsaw ransomware version, which was classified as malicious by our model with a score:

EvasionScore of Model

Our goal is to reduce the model score to a value below 0.5. To achieve this, we especially investigate the effects of changing the entropy, signature, DLL flag, and timestamp of our ransomware.


Based on our results, the entropy of an input file (b’section_44.bin_entropy_hashed’, b’256.bin_byte_entropy_histogram’…) seems to play a major role for the malware classification. Recall from Part 1 one that the .netshrink modification even increased the return value of our model above the original start value:

EvasionScore of Model

Since a compression algorithm usually increases the entropy of an input file, this could be a reason for the higher score. This leads to the assumption that our model interprets input files with high entropy as packed software (probably malware) and thus provides a higher value. Because it is quite difficult to reduce the entropy of a file, we will henceforth only make changes that do not increase the entropy significantly.


If we look at the results of split and gain, we see that b’dd_certificate_table_size’ occurs in both tables. In other words, signing the file seemingly influences the classification of the input file. Unfortunately, we can’t sign our malware with Microsoft’s private key, but we can create our own certificate and sign our Jigsaw ransomware with it. Maybe our model only verifies if there is an entry in the Certificate Table or not, no matter which certificate is used.

EvasionScore of Model

The value is lower than at the beginning but still far away from our goal. Let’s move on.


According to our results, the DLL flag in the COFF header is one of the most important features. In Part 1, when we compiled our Malware into a DLL, we got good results. Here we simply want to add the DLL flag to our already compiled ransomware. For this purpose, we work with the PE Explorer. Although this small modification breaks our Jigsaw ransomware executable, it is still worth giving it a try.

EvasionScore of Model

Apparently, it is not enough to set only the DLL flag to push the value below 0.5. Since our flagged ransomware has no other characteristics of a DLL, it is possible that it was never asked for the DLL flag in the decision path.


According to the split results, the timestamp is the most used feature in our tree. In my opinion, two approaches make sense here:

  • We set the timestamp of our malware to zero (i.e. 01.01.1970, where no malware existed).
  • We select a benign sample from the training set and replace the time stamp in our Jigsaw ransomware with the timestamp of the benign sample.

Once again we use PE Explorer to set the timestamp of our malware:

EvasionScore of Model
zero timestamp0.017957451
benign timestamp0.095834957

The timestamp of a file is apparently more important than it seems at the beginning. With this small modification, we managed to reach our goal and get a value far below 0.5.


In this blog series, our goal was to bypass a model for detecting malware through machine learning by modifying the Jigsaw ransomware. Using the Black-Box approach and the Gray-Box approach, we simulated two different attack scenarios on the model, and we succeeded in both approaches. Through feature engineering analysis, we were able to identify a blind spot on the model and exploit it. By changing the timestamp of our malware, we were able to reduce the rating of our model to a benign value. Nevertheless, this blog, like other recently published work in this area, clearly shows that the current static malware detection using machine learning cannot provide reliable results and depends on the support of the dynamic approach. If you have questions about this work, feel free to ask them in the Comment section below.


1. “PE Format,” Microsoft, [Online]. Available: [Accessed 5 5 2020].

2. H. Anderson and P. Roth, “EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models,” ArXiv e-prints, April 2018.

3. “LightGBM,” Microsoft, 2020. [Online]. Available: [Accessed 19 5 2020].

4. “LIEF: library for instrumenting executable files,” Quarkslab, 2017-2018. [Online]. Available:

5. K. Weinberger, A. Dasgupta, J. Attenberg, J. Lanfgord and A. Smola, “Feature Hashing for Large Scale Multitask Learning,” ArXiv e-prints, 2009.

6. “PE Explorer,” Heaventools Software, 2020. [Online]. Available: [Accessed 20 5 2020].

7. “,” 29 September 2019. [Online]. Available: [Accessed 6 5 2020].

8. “,” 29 September 2019. [Online]. Available: https:// [Accessed 6 5 2020].

9. “.netshrink,” PELock LLC, 2001 – 2020. [Online]. Available: [Accessed 15 5 2020].


  1. K

    Nice work. I’m really enjoying this series.

    Did you check the results of VirusTotal with this new approach?

    • Adrian Kress

      Hi K

      Thanks a lot for your feedback, I appreciate it.

      I uploaded the “zero timestamp” and the “benign timestamp” on VT, both reached a value of 45. This is somehow astonishing, since we actually only concentrated on the weak points in our model. But apparently the small modification of the timestamps was enough to bypass about 20 other malware detection engines.

  2. Indiana Jones

    You might like to change two lt () in your explanation figure of the decision tree.
    Interesting work. Thanks!

    • Adrian Kress


      Thank you very much for your feedback. I think for the sake of clarity I will leave it as it is.

Leave a Reply

Your email address will not be published. Required fields are marked *