Machine Learning for Emergent Middleware - Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Information Technology Reference

In-Depth Information

descriptions) to one or more categories. The main tool for implementing mod-

ern systems for automatic document classification is machine learning based on

vector space document representations.

In order to be able to apply standard machine learning methods for building

categorizers, we need to represent the objects we want to classify by extracting

informative features . Such features are used as indications that an object belongs

to a certain category. For categorisation of documents, the standard representa-

tion of features maps every document into a vector space using the bag-of-words

approach [24]. In this method, every word in the vocabulary is associated with

a dimension of the vector space, allowing the document to be mapped into the

vector space simply by computing the occurrence frequencies of each word. For

example, a document consisting of the string “get Weather, get Station” could

be represented as the vector (2 , 1 , 1 ,... ) where, e.g., 2 in the first dimension is

the frequency of the “get” token. The bag-of-words representation is considered

the standard representation underlying most document classification approaches.

In contrast, attempts to incorporate more complex structural information have

mostly been unsuccessful for the task of categorisation of single documents [21]

although they have been successful for complex relational classification tasks [19].

However, the task of classifying interface descriptions is different from classify-

ing raw textual documents. Indeed, the interface descriptions are semi-structured

rather than unstructured, and the representation method clearly needs to take

this fact into account, for instance, by separating the vector space representation

into regions for the respective parts of the interface description. In addition to

the text, various semi-structured identifiers should be included in the feature

representation, e.g., the names of the method and input parameters defined by

the interface. The inclusion of identifiers is important since: ( i ) the textual con-

tent of the identifiers is often highly informative of the functionality provided by

the respective methods; and ( ii ) the free text documentation is not mandatory

and may not always be present.

For example, if the functionality of the interface are described by an XML

file written in WSDL, we would have tags and structures, as illustrated by the

text fragment below, which relates to a NS implementing a weather station and

is part of the GMES scenario detailed in the next section on experiments:

< wsdl : message name=”GetWeatherByZipCodeSoapIn” >

< wsdl: part name=”parameters”

element=”tns :GetWeatherByZipCode” / >

< /wsdl :message >

< wsdl : message name=”GetWeatherByZipCodeSoapOut” >

< wsdl: part name=”parameters”

element=”tns :GetWeatherByZipCodeResponse”/ >

< /wsdl :message >

It is clear that splitting the CamelCase identifier GetWeatherStation into

the tokens get , weather ,and station , would provide more meaningful and

generalised concepts, which the learning algorithm can use as features. Indeed,

to extract useful word tokens from the identifiers, we split them into pieces based

Trustworthy Eternal Systems via Evolving, Software Data and Knowledge

Search WWH ::

Custom Search

Home