Information Technology Reference
In-Depth Information
that labelling efforts can be reduced in the malware detection industry, while
still maintaining a high rate of accuracy in the task.
2 Opcode-Sequence Features for Malware Detection
To represent executables using opcodes, we extract the opcode-sequences and
their frequency of appearance. Specifically, we define a program ρ as a set of
ordered opcodes o , ρ =( o 1 ,o 2 ,o 3 ,o 4 , ..., o 1 ,o ) , where is the number of in-
structions I of the program ρ . An opcode sequence os is defined as a subset of
opcodes within the executable file where os
ρ ;itismadeupofopcodes o ,
os =( o 1 ,o 2 ,o 3 , ..., o m 1 ,o m )where m is the length of the sequence of opcodes
os . Consider an example code formed by the opcodes mov , add , push and add ;
the following sequences of length 2 can be generated: s 1 =( mov,add ), s 2 =( add ,
push )and s 3 =( push , add ).
Afterwards, we compute the frequency of occurrence of each opcode sequence
within the file by using term frequency (tf) [12] that is a weight widely used
in information retrieval: tf i,j
n i,j
k n k,j
where n i,j is the number of times the
sequence s i,j (in our case opcode sequence) appears in an executable e ,and
=
k n k,j is the total number of terms in the executable e (in our case the total
number of possible opcode sequences).
We define the Weighted Term Frequency (WTF) as the result of weighting
the relevance of each opcode when calculating the term frequency. To calculate
the relevance of each individual opcode, we collected malware from the Vx-
Heavens website 1 to assemble a malware dataset of 13,189 malware executables
and we collected 13,000 executables from our computers. Using this dataset,
we disassemble each executable and compute the mutual information gain for
each opcode and the class: I ( X ; Y )= yY xX p ( x, y )log p ( x,y )
p ( x ) ·p ( y ) where
X istheopcodefrequencyand Y is the class of the file (i.e., malware or be-
nign software), p ( x, y ) is the joint probability distribution function of X and
Y ,and p ( x )and p ( y ) are the marginal probability distribution functions of X
and Y . In our particular case, we defined the two variables as the single op-
code and whether or not the instance was malware. Note that this weight only
measures the relevance of a single opcode and not the relevance of an opcode
sequence.
Using these weights, we computed the WTF as the product of sequence fre-
quencies and the previously calculated weight of every opcode in the sequence:
wtf i,j
· o z S
weight ( o z )
100 where weight ( o z ) is the calculated weight, by
means of mutual information gain, for the opcode o z and tf i,j is the sequence fre-
quency measure for the given opcode sequence. We obtain a vector
= tf i,j
v
composed of
weighted opcode-sequence frequencies,
v
=(( os 1 ,wtf 1 ) , ..., ( os n ,wtf n )), where
os i
is the opcode sequence and wtf i
is the weighted term frequency for that
particular opcode sequence.
1 http://vx.netlux.org/
 
Search WWH ::




Custom Search