Digging into IP Flow Records with a Visual Kernel Method - Computational Intelligence in Security for Information Systems

Information Technology Reference

In-Depth Information

Opcode-Sequence-Based Semi-supervised

Unknown Malware Detection

Igor Santos, Borja Sanz, Carlos Laorden, Felix Brezo, and Pablo G. Bringas

S 3 Lab, DeustoTech - Computing, Deusto Institute of Technology

University of Deusto,

Avenida de las Universidades 24, 48007

Bilbao, Spain

{isantos,borja.sanz,claorden,felix.brezo,pablo.garcia.bringas}@deusto.es

Abstract. Malware is any computer software potentially harmful to

both computers and networks. The amount of malware is growing every

year and poses a serious global security threat. Signature-based detection

is the most extended method in commercial antivirus software, however,

it consistently fails to detect new malware. Supervised machine learning

has been adopted to solve this issue, but the usefulness of supervised

learning is far to be complete because it requires a high amount of mali-

cious executables and benign software to be identified and labelled pre-

viously. In this paper, we propose a new method of malware detection

that adopts a well-known semi-supervised learning approach to detect

unknown malware. This method is based on examining the frequencies of

the appearance of opcode sequences to build a semi-supervised machine-

learning classifier using a set of labelled (either malware or legitimate

software) and unlabelled instances. We performed an empirical validation

demonstrating that the labelling efforts are lower than when supervised

learning is used while the system maintains high accuracy rate.

Keywords: malware

detection

learning,

machine

learning,

semi-

supervised learning.

1

Introduction

Malware is defined as any computer software explicitly designed to damage com-

puters or networks. While in the past malware writers seek 'fame and glory',

currently their motivation has evolved to malicious economic considerations [1].

The commercial anti-malware software is highly dependant on a signature

database [2]. A signature is a unique sequence of bytes that is always present

within malicious executables and in the files already infected. The main issue

of this approach is that malware analysts must wait until new malware has

harmed several computers to generate a signature file and provide a solution.

Analysed suspect files are compared with this list of signatures. When the signa-

tures match, the file being tested is classified as malware. Although this approach

has been proven as effective when threats are known in beforehand, these signa-

ture methods are surpassed with large amounts of new malware.

Computational Intelligence in Security for Information Systems

Search WWH ::

Custom Search

Home