Information Technology Reference
In-Depth Information
3 Adopting the Framework in the Security Domain
Organizing not structured text in order to obtain the same information content
in a semantically structured fashion is a challenging research field that can be
applied in different contexts. For example, a possible scenario is represented by
structuring information using a semantic approach (with techniques for knowl-
edge extraction and representation) in order to develop a semantic search engine
that enables the access to information contents (semantic) of a not structured
document set. Another interesting scenario is represented by the development of
a semantic interpretation module that enables software or hardware agents to
identify situations of interest and enhancing the cognition ability (for example
by the recognition and learning of vocal and/or gestural commands). Moreover
a semantic approach can be used on unstructured information in order to detect
sensible information and for enforcing fine grained access control on these.
These scenarios have motivated the proposal of the framework: it can be
instantiated in different contexts. Each context has in common the need to or-
ganize and transform heterogeneous documents in a structured form. In order to
contextualize the framework in a particular domain, it is necessary to perform a
tuning phase by means of techniques, algorithms and input parameters selection.
In order to properly locate and characterize text sections, the application of
semantic text processing techniques is needed. The comprehension of a particular
concept within a specialized domain, as the medical one, requires information
about the properties characterizing it, as well as the ability to identify the set of
entities that the concepts refer to. At this aim the preprocesing, transformation
and postprocessing modules should respectively: (i) breaking a stream of text
up into a list of words, phrases, or other meaningful elements called tokens
and marking up the tokens as corresponding to a particular part of speech; (ii)
filtering the token list obtaining the relevant ones in order to build concepts;
(iii) identifying the text macro-structures (sections).
In order to process the input text and produce a list of words, the prepro-
cessing module implements in sequence text tokenization, text normalization,
part-of-speech tagging and lemmatization procedures. Text Tokenizat ion consist
in sentences segmentation into minimal units of analysis, which constitute sim-
ple or complex lexical items, including compounds, abbreviations, acronyms and
alphanumeric expressions; Text Normalization takes variations of the same lexi-
cal expression back in a unique way; Part-Of-Speech (POS) Tagging consists in
the assignment of a grammatical category (noun, verb, adjective, adverb, etc.) to
each lexical unit identified within the text collection; Text Lemmatization is per-
formed in order to reduce all the inflected forms to the respective lemma, or cita-
tion form, coinciding with the singular male/female form for nouns, the singular
male form for adjectives and the infinitive form for verbs. These procedures are
language dependent, consisting of several sub-steps, and are implemented by us-
ing the state of the art NLP modules [5]. At this point, a list of tokens is obtained
from the raw data. In order to identify concepts, not all words are equally use-
ful: some of them are semantically more relevant than others, and among these
words there are lexical items weighting more than other. The transformation
 
Search WWH ::




Custom Search