Random Forest Classifier On Malware
(copyright 2020 by YI SHA, if you want to re-post this,please send me an email:shayi1983end@gmail.com)
(全英语版)处理恶意软件的随机森林分类器算法(Random Forest Classifier On Malware)Overview
随机森林分类器是最近很流行的一种识别恶意软件的机器学习算法,由 python 编程语言实现;用于杀毒软件的传统基于特征码、签名、启发式识别已经无法完全检测大量的变体,因此需要一种高效和准确的方法。很幸运的是我们有开源的 sklearn 库能够利用:
In this tutorial,I'll show you how to use random forest classifier machine learning algorithm to detect malware by Python programming language;
The traditional yet obsolete signature-based or heuristic approach used by majority anti-virus softwares are no longer suitable for detecting huge-scale malware variations emerged nowadays;for these billions of variations,we need a fast、automatically and accurately way to make judgement about an unknown software binary is malicious or benign;
The Python sklearn library provide a Random Forest Classifier Class for doing this job excellently,note the simplest way of using random forest algorithm is in a dichotomy scenario:determine or classified an unknown object into its two possible categories ;which means any task that involve dichotomy,not merely malware-benign ware identification,can take advantage of Random Forest Classifier;
So let's enter our topic,from a high-level overview perspective,I'll extract any printable string whose length large than five characters from the two training datasets:malware and benign ware,respectively;then compress these data using hashing trick to save memory usage and boosting analysis speed;then I use these data,along with a label vector,to train our random forest classifier machine learning model,make it to have a general concept about what is a malware or benign ware;finally,I pass in a sheer previously unseen Windows PE binary file to this classifier,let it make prediction,the resulting value is a probability of its maliciousness,and feed this to other components logic inside an anti-virus;
(don't worry too much about aforementioned terminologies,I will explain them as I bring you to the code line by line;)
Implementation and Execution
We import the first three prerequis Python libraries:
? re(regular expression);
? numpy;
? FeatureHasher Class(perform string hashing ):
The definition of function get_string_features() as shown in following figures,it take an absolute filename path as its first argument,and an instance of FeatureHasher Class as its 2nd argument;
The "front-end" of this function open a PE binary file specified by caller,and use regular expression performing text match on that file,return all matched strings into a list(the variable strings);
For example,if we extract strings from a malware binary using above code snippet,findall() method will return a list containing all candidate strings:
The "back-end" of this function iterate over this strings list,using every string as a key,and 1 as its corresponding value to build a feature dictionary,indicating that string existing within this binary;then it use the transform() method coming from FeatureHasher Class, to compress this dictionary,after that,dense the resulting sparse matrix,convert it to a standard numpy array,and return the first element to the caller:
To make this point more clear,I do some experiment to show you the internal working of that code chunk:
As you can see from the above figure,compare to the original list we used for storage raw strin