设为首页 加入收藏

TOP

(全英语版)处理恶意软件的随机森林分类器算法(Random Forest Classifier On Malware)(一)
2023-08-26 21:10:40 】 浏览:139
Tags:全英语 林分类 Random Forest Classifier Malware

Random Forest Classifier On Malware

(copyright 2020 by YI SHA, if you want to re-post this,please send me an email:shayi1983end@gmail.com)

(全英语版)处理恶意软件的随机森林分类器算法(Random Forest Classifier On Malware)

Overview


随机森林分类器是最近很流行的一种识别恶意软件的机器学习算法,由 python 编程语言实现;用于杀毒软件的传统基于特征码、签名、启发式识别已经无法完全检测大量的变体,因此需要一种高效和准确的方法。很幸运的是我们有开源的 sklearn 库能够利用:

In this tutorial,I'll show you how to use random forest classifier machine learning algorithm to detect malware by Python programming language;

The traditional yet obsolete signature-based or heuristic approach used by majority anti-virus softwares  are no longer suitable for detecting huge-scale malware variations emerged nowadays;for these billions of variations,we need a fast、automatically and accurately way to make judgement about an unknown software binary is malicious or benign;


The Python sklearn library provide a Random Forest Classifier Class for doing this job excellently,note the simplest way of using random forest algorithm is in a dichotomy scenario:determine or classified an unknown object into its two possible categories ;which means any task that involve dichotomy,not merely malware-benign ware identification,can take advantage of Random Forest Classifier; 


So let's enter our topic,from a high-level overview perspective,I'll extract any printable string whose length large than five characters from the two training datasets:malware and benign ware,respectively;then compress these data using hashing trick to save memory usage and boosting analysis speed;then I use these data,along with a label vector,to train our random forest classifier machine learning model,make it to have a general concept about what is a malware or benign ware;finally,I pass in a sheer previously unseen Windows PE binary file to this classifier,let it make prediction,the resulting value is a probability of its maliciousness,and feed this to other components logic inside an anti-virus;

(don't worry too much about aforementioned terminologies,I will explain them as I bring you to the code line by line;)


Implementation and Execution

We import the first three prerequis Python libraries: 

? re(regular expression); 

? numpy; 

? FeatureHasher Class(perform string hashing ):



The definition of function get_string_features() as shown in following figures,it take an absolute filename path as its first argument,and an instance of FeatureHasher Class as its 2nd argument;

The "front-end" of this function open a PE binary file specified by caller,and use regular expression performing text match on that file,return all matched strings into a list(the variable strings);


For example,if we extract strings from a malware binary using above code snippet,findall() method will return a list containing all candidate strings:


The "back-end" of this function iterate over this strings list,using every string as a key,and 1 as its corresponding value to build a feature dictionary,indicating that string existing within this binary;then it use the transform() method coming from FeatureHasher Class, to compress this dictionary,after that,dense the resulting sparse matrix,convert it to a standard numpy array,and return the first element to the caller:


To make this point more clear,I do some experiment to show you the internal working of that code chunk:



As you can see from the above figure,compare to the original list we used for storage raw strin

首页 上一页 1 2 3 4 下一页 尾页 1/4/4
】【打印繁体】【投稿】【收藏】 【推荐】【举报】【评论】 【关闭】 【返回顶部
上一篇离线安装Python第三方库及依赖包 下一篇【pandas小技巧】--统计值作为新列

最新文章

热门文章

Hot 文章

Python

C 语言

C++基础

大数据基础

linux编程基础

C/C++面试题目