Information Technology Reference
In-Depth Information
descriptions) to one or more categories. The main tool for implementing mod-
ern systems for automatic document classification is machine learning based on
vector space document representations.
In order to be able to apply standard machine learning methods for building
categorizers, we need to represent the objects we want to classify by extracting
informative features . Such features are used as indications that an object belongs
to a certain category. For categorisation of documents, the standard representa-
tion of features maps every document into a vector space using the bag-of-words
approach [24]. In this method, every word in the vocabulary is associated with
a dimension of the vector space, allowing the document to be mapped into the
vector space simply by computing the occurrence frequencies of each word. For
example, a document consisting of the string “get Weather, get Station” could
be represented as the vector (2 , 1 , 1 ,... ) where, e.g., 2 in the first dimension is
the frequency of the “get” token. The bag-of-words representation is considered
the standard representation underlying most document classification approaches.
In contrast, attempts to incorporate more complex structural information have
mostly been unsuccessful for the task of categorisation of single documents [21]
although they have been successful for complex relational classification tasks [19].
However, the task of classifying interface descriptions is different from classify-
ing raw textual documents. Indeed, the interface descriptions are semi-structured
rather than unstructured, and the representation method clearly needs to take
this fact into account, for instance, by separating the vector space representation
into regions for the respective parts of the interface description. In addition to
the text, various semi-structured identifiers should be included in the feature
representation, e.g., the names of the method and input parameters defined by
the interface. The inclusion of identifiers is important since: ( i ) the textual con-
tent of the identifiers is often highly informative of the functionality provided by
the respective methods; and ( ii ) the free text documentation is not mandatory
and may not always be present.
For example, if the functionality of the interface are described by an XML
file written in WSDL, we would have tags and structures, as illustrated by the
text fragment below, which relates to a NS implementing a weather station and
is part of the GMES scenario detailed in the next section on experiments:
< wsdl : message name=”GetWeatherByZipCodeSoapIn” >
< wsdl: part name=”parameters”
element=”tns :GetWeatherByZipCode” / >
< /wsdl :message >
< wsdl : message name=”GetWeatherByZipCodeSoapOut” >
< wsdl: part name=”parameters”
element=”tns :GetWeatherByZipCodeResponse”/ >
< /wsdl :message >
It is clear that splitting the CamelCase identifier GetWeatherStation into
the tokens get , weather ,and station , would provide more meaningful and
generalised concepts, which the learning algorithm can use as features. Indeed,
to extract useful word tokens from the identifiers, we split them into pieces based
Search WWH ::




Custom Search