Skip Navigation
BlackBerry Blog

Improving Malware Detection Accuracy by Extracting Icon Information

At the 2018 IEEE International Conference on Multimedia Information Processing and Retrieval, we published a paper titled Improving Malware Detection Accuracy by Extracting Icon Information, wherein we described a novel technique for classifying PE malware using icon-based features.

The intuition is that the icons embedded in Portable Executable (PE) malware tend to be different from the icons you would likely find in benign files. What we’ve found is that by extracting icons from PEs and using some traditional computer vision techniques to generate additional features from those icons, we can noticeably improve classifier performance.

In this blog, we will explain our approach for extracting features from icons embedded in PEs and show that these icon features improve classifier performance in a simple example with 2,276 publicly available PE files (50% malware and 50% benign).

Feature Engineering

While the most straightforward features from each icon are the raw RGB pixel intensities, these features are either too noisy or too numerous to be effective in our malware classifier. We, instead, use three types of features generated from the original icon pixels:     

  • Summary statistics features
  • Histogram of oriented gradient (HOG) features
  • Autoencoder (AE) features

Summary Statistics Features:

Raw pixel intensities (i.e., RGB values) tend to be noisy, but by generating summary statistics (mean and variance) features, we can still capture the information in the raw pixels while being less impacted by the noise. In particular, we compute mean and standard deviation of the pixels under the three following cases:

  • on all the pixels and across all three channels (Red, Green, and Blue)
  • on each channel separately
  • by splitting the icon image into 9 different sections (Figure 1) and computing summary statistics per section and across the three channels


Figure 1. We split the image into 9 sections and compute statistics for each section.

Histogram of Oriented Gradients (HOG) Features

Histogram of Oriented Gradients (HOG) is a traditional computer vision technique for extracting features from an image. The idea is to split an image into many small patches, and to then count how many times edges with different orientations are seen in those patches.

The utility of edge-based features like HOG has been known for a long time. In fact, modern convolutional neural networks seem to learn edge-based representations for image classification tasks automatically, and there’s even evidence that mammalian vision systems contain cells that respond to edges.

HOG features capture contour, silhouette, and some texture information while providing robustness to variations in illumination and color (e.g., when malware authors take an existing icon and modify it slightly). To generate the HOG features, we slide a small window over the image and we compute the gradient of the image within that window [Walt et al].

Autoencoder (AE) Features

Instead of hand-engineering features (2.1 and 2.2), we can let a neural network learn important features itself. In particular, we use a convolutional autoencoder neural network [Y. Bengio] in order to learn a more condensed set of features from icons.

In order to decompress the icon information, the network has to learn what are the most important features that it has to keep for each icon image in order to be possible for the network to recreate the original icon image relatively well. We trained the AE with a large set of icons to make it more robust and generalizable [K. He, et al].

Icon Clustering

In total, with the summary statistics, HOG features, and autoencoder features, we extract a total of 1,114 features from each icon. While we could supply all these features into a classifier, however, one could go a step further by reducing the dimensionality of the data using techniques such as random projections, principal component analysis, t-SNE, etc. (figure 2) in order to reduce the number of features from 1,114 while still retaining valuable information.

We, however, take a different approach by using an unsupervised ML model in order to cluster icon images based on their engineered 1,114 features. This will help both borrow more information from the neighboring icons and, further, helps compress all the information gathered from icon images into a single cluster ID:


Figure 2. Visualizing some icons with t-SNE.

In order to cluster icon images based on the extracted 1,114 features, we chose the hierarchical density-based spatial clustering algorithm (aka HDBSCAN). The capability of the algorithm in detecting an outlier cluster is an attractive property for our purpose.

Another important advantage of the algorithm is that the algorithm itself can learn the number of clusters. We used a silhouette score to measure the quality of the detected clusters. We take this a step further and by performing an additional K-Means algorithm to the outlier class, we make sure that all the icons are properly clustered.

There is also a down-side with the HDBSCAN algorithm and that is the algorithm does not provide a prediction function for a new sample (a never-seen-before sample). Our procedure then to predict a cluster ID for a never-seen-before sample is as follows:

  • Generate the 1,114 engineered features for the new sample
  • Use the K-Nearest Neighbor algorithm to predict the cluster ID for the sample and by using the training samples that have been already clustered using the HDBSCAN algorithm
  • If K-NN led to an outlier cluster, then perform a second K-Means

Testing the Effectiveness of the Icon Features

In order to test the efficacy of our proposed method in terms of enhancement in malware prediction, we use a balanced sample of publicly available PE files obtained from VirusTotal with 1,138 benign and 1,138 malware files. In order to visualize the icons we use in the experiment (Figure 2), we use t-SNE [L. Mateen, et al] on the raw icon pixels.

Figure 2, produced with t-SNE, shows malware and benign icons are well mixed, yet we will show in this section that our new approach is still capable of using the information in the icons to better predict malware.

Using our proposed method, we initially generate icon features (1,114 features). We then cluster icons. Further, per each PE file and by using the publicly available python package PEfile, we generate “entropy”, “Misc VirtualSize”, and “SizeOfRawData” features from the three sections of “.text”, “.data”, and “.rsrc”; we shall refer to these features as structure features.

In order to test the effectiveness of icon clusters generated using our proposed method in detecting malware, we build the following three prediction models:

  • Lasso Logistic Regression (L1)
  • Ridge Logistic Regression (L2)
  • Linear Support Vector Machine (SVM)

Each model is then fit once using only the structure features and another time with both structure features and one-hot encoded icon cluster feature [analysis code].

In order to better estimate out-of-sample accuracy of the models, the original data is randomly split into train data (80% of the data) and test data (20% of the data). The division of the data into train and test is being done using a stratified sampling method which guarantees balanced labels in the generated train and test data.

Test data remain untouched during the model fitting process and are solely used for the final out-of-sample accuracy evaluation of the model. To avoid overfitting, all of our proposed models are regularized (either L1 or L2). Regularization parameters are tuned using a stratified 4-fold cross-validation process.

As an example, Figure 3 shows results from the optimization process of the regularization parameter of the lasso logistic regression model using stratified 4-fold cross-validation:


Figure 3. Results from the optimization process of the regularization parameter, α, using stratified 4-fold cross validation for the lasso logistic regression model.

Table 1 shows the results of fitting the three models both with icon cluster feature and without icon cluster feature. The table shows the regularization parameter, α, that is optimized using a stratified 4-fold cross-validation.

The table also includes measures of accuracy at the optimized α value: cross-validation accuracy and its standard error, cross-validation true positive rate and the standard error, and cross-validation true negative rate and its standard deviation. Finally, the fitted models are tested using the test data, and we report test accuracy, test true positive rate, test true negative rate, and the area under the curve:


As one can see, adding the icon cluster feature to the feature set has consistently boosted accuracy and area under the curve across all three models. This is a clear indication that our proposed icon clustering technique can help improve detecting malware in PE files. Figure 4 also supports this conclusion.

In this figure, we show the ROC curves associated with our three candidate models. The black curve corresponds to the model with icon cluster feature and the red curve corresponds to the model without the icon cluster feature:


Figure 4. ROC curve for the candidate models. The red line corresponds to a model using only the PEFile features. The black line corresponds to a model using the icon cluster ID feature in addition to the PEfile features.

Conclusion

This blog summarizes a research paper where we claimed file icons can help improve the accuracy of ML models for detecting malware. We proposed a new approach on extracting information from file icons, and further proposed using an unsupervised approach in order to reduce the engineered features into a single icon cluster ID. Then by running experiments on ~2,200 publicly available PE files, we showed that using icon information can significantly improve the accuracy of a malware classifier.

Additional Reference:

https://www.computer.org/csdl/proceedings/mipr/2018/1857/00/185701a408.pdf

Sepehr Akhavan Masouleh

About Sepehr Akhavan Masouleh

Senior Data Scientist at BlackBerry Cylance

Sepehr Akhavan Masouleh is a Senior Data Scientist at Cylance Inc. where he works on the application of Machine Learning, Deep Learning, and Statistical Modeling in cybersecurity. He holds a PhD in Statistics from University of California Irvine with his dissertation focused on Probabilistic Learning, Bayesian Inference and Bayesian Non-Parametrics.

Sepehr is also passionate about teaching and has served as Machine Learning instructor at UC Irvine.