Structural Feature Based Anomaly Detection for Packed Executable Identification - Computational Intelligence in Security for Information Systems

Information Technology Reference

In-Depth Information

3 Adopting the Framework in the Security Domain

Organizing not structured text in order to obtain the same information content

in a semantically structured fashion is a challenging research field that can be

applied in different contexts. For example, a possible scenario is represented by

structuring information using a semantic approach (with techniques for knowl-

edge extraction and representation) in order to develop a semantic search engine

that enables the access to information contents (semantic) of a not structured

document set. Another interesting scenario is represented by the development of

a semantic interpretation module that enables software or hardware agents to

identify situations of interest and enhancing the cognition ability (for example

by the recognition and learning of vocal and/or gestural commands). Moreover

a semantic approach can be used on unstructured information in order to detect

sensible information and for enforcing fine grained access control on these.

These scenarios have motivated the proposal of the framework: it can be

instantiated in different contexts. Each context has in common the need to or-

ganize and transform heterogeneous documents in a structured form. In order to

contextualize the framework in a particular domain, it is necessary to perform a

tuning phase by means of techniques, algorithms and input parameters selection.

In order to properly locate and characterize text sections, the application of

semantic text processing techniques is needed. The comprehension of a particular

concept within a specialized domain, as the medical one, requires information

about the properties characterizing it, as well as the ability to identify the set of

entities that the concepts refer to. At this aim the preprocesing, transformation

and postprocessing modules should respectively: (i) breaking a stream of text

up into a list of words, phrases, or other meaningful elements called tokens

and marking up the tokens as corresponding to a particular part of speech; (ii)

filtering the token list obtaining the relevant ones in order to build concepts;

(iii) identifying the text macro-structures (sections).

In order to process the input text and produce a list of words, the prepro-

cessing module implements in sequence text tokenization, text normalization,

part-of-speech tagging and lemmatization procedures. Text Tokenizat ion consist

in sentences segmentation into minimal units of analysis, which constitute sim-

ple or complex lexical items, including compounds, abbreviations, acronyms and

alphanumeric expressions; Text Normalization takes variations of the same lexi-

cal expression back in a unique way; Part-Of-Speech (POS) Tagging consists in

the assignment of a grammatical category (noun, verb, adjective, adverb, etc.) to

each lexical unit identified within the text collection; Text Lemmatization is per-

formed in order to reduce all the inflected forms to the respective lemma, or cita-

tion form, coinciding with the singular male/female form for nouns, the singular

male form for adjectives and the infinitive form for verbs. These procedures are

language dependent, consisting of several sub-steps, and are implemented by us-

ing the state of the art NLP modules [5]. At this point, a list of tokens is obtained

from the raw data. In order to identify concepts, not all words are equally use-

ful: some of them are semantically more relevant than others, and among these

words there are lexical items weighting more than other. The transformation

Computational Intelligence in Security for Information Systems

Search WWH ::

Custom Search

Home