Information Technology Reference
In-Depth Information
Machine-learning-based approaches train classification algorithms that detect
new malware, by means of datasets composed of several characteristic features of
both malicious and benign software. Schultz et al. [3] were the first to introduce
the concept of applying machine-learning models to the detection of malware
based on their respective binary codes. Specifically, they applied several classi-
fiers to three different feature sets: (i) program headers, (ii) strings and (iii) byte
sequences.
Later, Kolter et al. [4] improved Schulz's results by applying n-grams (i.e.,
overlapping byte sequences) instead of non-overlapping sequences. This approach
employed several algorithms, achieving the best results with a boosted decision
tree. Likewise, substantial research has focused on n-gram distributions of byte
sequences and data-mining [5,6].
Additionally, opcode sequences have recently been introduced as an alterna-
tive to byte n-grams [7]. This approach appears to be theoretically better than
byte n-grams because it relies on source code rather than the bytes of a binary
file that can be easier changed than code [8].
However, these supervised machine-learning classifiers require a high number
of labelled executables for each of the classes. Sometimes, we can omit one class
for labelling, such as in anomaly detection for intrusion detection [9]. It is quite
dicult to obtain this amount of labelled data for a real-world problem such
as malicious code analysis. To gather these data, a time-consuming process of
analysis is mandatory, and in the process, some malicious executables are able
to surpass detection.
Semi-supervised learning is a type of machine-learning technique specially
useful when a fixed amount of labelled data exists for each file class. These
techniques generate a supervised classifier based on labelled data and predict
the label for every unlabelled instance. The instances whose classes have been
predicted surpassing a certain threshold of confidence are added to the la-
belled dataset. The process is repeated until certain conditions are satisfied (a
commonly used criterion is the maximum likelihood found by the expectation-
maximisation technique). These approaches enhance the accuracy of fully unsu-
pervised methods (i.e., no labels within the dataset) [10].
Given this background, we propose here an approach that employs a semi-
supervised learning technique for the detection of unknown malware. In particu-
lar, we utilise the method Learning with Local and Global Consistency (LLGC)
[11] able to learn from both labelled and unlabelled data and capable of pro-
viding a smooth solution with respect to the intrinsic structure displayed by
both labelled and unlabelled instances. For the representation of executables,
we propose the adoption of LLGC for the detection of unknown malware based
on opcode sequences [7]. However, the presented semi-supervised methodology is
scalable to any representation susceptible to be represented as a feature vector.
Summarising, our main findings in this paper are: (i) we describe how to adopt
LLGC for opcode-sequence-based unknown malware detection, (ii) we empiri-
cally determine the optimal number of labelled instances and we evaluated how
this parameter affects the final accuracy of the models and (iii) we demonstrate
 
Search WWH ::




Custom Search