Digging into IP Flow Records with a Visual Kernel Method - Computational Intelligence in Security for Information Systems

Information Technology Reference

In-Depth Information

Machine-learning-based approaches train classification algorithms that detect

new malware, by means of datasets composed of several characteristic features of

both malicious and benign software. Schultz et al. [3] were the first to introduce

the concept of applying machine-learning models to the detection of malware

based on their respective binary codes. Specifically, they applied several classi-

fiers to three different feature sets: (i) program headers, (ii) strings and (iii) byte

sequences.

Later, Kolter et al. [4] improved Schulz's results by applying n-grams (i.e.,

overlapping byte sequences) instead of non-overlapping sequences. This approach

employed several algorithms, achieving the best results with a boosted decision

tree. Likewise, substantial research has focused on n-gram distributions of byte

sequences and data-mining [5,6].

Additionally, opcode sequences have recently been introduced as an alterna-

tive to byte n-grams [7]. This approach appears to be theoretically better than

byte n-grams because it relies on source code rather than the bytes of a binary

file that can be easier changed than code [8].

However, these supervised machine-learning classifiers require a high number

of labelled executables for each of the classes. Sometimes, we can omit one class

for labelling, such as in anomaly detection for intrusion detection [9]. It is quite

dicult to obtain this amount of labelled data for a real-world problem such

as malicious code analysis. To gather these data, a time-consuming process of

analysis is mandatory, and in the process, some malicious executables are able

to surpass detection.

Semi-supervised learning is a type of machine-learning technique specially

useful when a fixed amount of labelled data exists for each file class. These

techniques generate a supervised classifier based on labelled data and predict

the label for every unlabelled instance. The instances whose classes have been

predicted surpassing a certain threshold of confidence are added to the la-

belled dataset. The process is repeated until certain conditions are satisfied (a

commonly used criterion is the maximum likelihood found by the expectation-

maximisation technique). These approaches enhance the accuracy of fully unsu-

pervised methods (i.e., no labels within the dataset) [10].

Given this background, we propose here an approach that employs a semi-

supervised learning technique for the detection of unknown malware. In particu-

lar, we utilise the method Learning with Local and Global Consistency (LLGC)

[11] able to learn from both labelled and unlabelled data and capable of pro-

viding a smooth solution with respect to the intrinsic structure displayed by

both labelled and unlabelled instances. For the representation of executables,

we propose the adoption of LLGC for the detection of unknown malware based

on opcode sequences [7]. However, the presented semi-supervised methodology is

scalable to any representation susceptible to be represented as a feature vector.

Summarising, our main findings in this paper are: (i) we describe how to adopt

LLGC for opcode-sequence-based unknown malware detection, (ii) we empiri-

cally determine the optimal number of labelled instances and we evaluated how

this parameter affects the final accuracy of the models and (iii) we demonstrate

Computational Intelligence in Security for Information Systems

Search WWH ::

Custom Search

Home