（全英语版）处理恶意软件的随机森林分类器算法（Random Forest Classifier On Malware） - Python

TOP

（全英语版）处理恶意软件的随机森林分类器算法（Random Forest Classifier On Malware）(一)

2023-08-26 21:10:40 【大中小】浏览:131次

Tags：全英语林分类 Random Forest Classifier Malware

Random Forest Classifier On Malware

（全英语版）处理恶意软件的随机森林分类器算法（Random Forest Classifier On Malware）

Overview

随机森林分类器是最近很流行的一种识别恶意软件的机器学习算法，由 python 编程语言实现；用于杀毒软件的传统基于特征码、签名、启发式识别已经无法完全检测大量的变体，因此需要一种高效和准确的方法。很幸运的是我们有开源的 sklearn 库能够利用：

In this tutorial，I'll show you how to use random forest classifier machine learning algorithm to detect malware by Python programming language；

The traditional yet obsolete signature-based or heuristic approach used by majority anti-virus softwares are no longer suitable for detecting huge-scale malware variations emerged nowadays；for these billions of variations，we need a fast、automatically and accurately way to make judgement about an unknown software binary is malicious or benign；

The Python sklearn library provide a Random Forest Classifier Class for doing this job excellently，note the simplest way of using random forest algorithm is in a dichotomy scenario：determine or classified an unknown object into its two possible categories ；which means any task that involve dichotomy，not merely malware-benign ware identification，can take advantage of Random Forest Classifier；

So let's enter our topic，from a high-level overview perspective，I'll extract any printable string whose length large than five characters from the two training datasets：malware and benign ware，respectively；then compress these data using hashing trick to save memory usage and boosting analysis speed；then I use these data，along with a label vector，to train our random forest classifier machine learning model，make it to have a general concept about what is a malware or benign ware；finally，I pass in a sheer previously unseen Windows PE binary file to this classifier，let it make prediction，the resulting value is a probability of its maliciousness，and feed this to other components logic inside an anti-virus；

（don't worry too much about aforementioned terminologies，I will explain them as I bring you to the code line by line；）

Implementation and Execution

We import the first three prerequis Python libraries：

? re（regular expression）；

? numpy；

? FeatureHasher Class（perform string hashing ）：

The definition of function get_string_features() as shown in following figures，it take an absolute filename path as its first argument，and an instance of FeatureHasher Class as its 2nd argument；

The "front-end" of this function open a PE binary file specified by caller，and use regular expression performing text match on that file，return all matched strings into a list（the variable strings）；

For example，if we extract strings from a malware binary using above code snippet，findall() method will return a list containing all candidate strings：

The "back-end" of this function iterate over this strings list，using every string as a key，and 1 as its corresponding value to build a feature dictionary，indicating that string existing within this binary；then it use the transform() method coming from FeatureHasher Class， to compress this dictionary，after that，dense the resulting sparse matrix，convert it to a standard numpy array，and return the first element to the caller：

To make this point more clear，I do some experiment to show you the internal working of that code chunk：

As you can see from the above figure，compare to the original list we used for storage raw strin

首页上一页 1 2 3 4 下一页尾页 1/4/4
【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：离线安装Python第三方库及依赖包	下一篇：【pandas小技巧】--统计值作为新列