Information Technology Reference
In-Depth Information
2.1 Types of Data Input
With regard to the text sequences considered in this paper, Greer ( 2011 ) describes
how a time element can be used to de
ne sequences of events that might contain
groups of concepts. A time stamp can record when the concept is presented to the
concept base, with groups presented at the same time being considered to be related
to each other. This is therefore built largely on the
of the system, where these
concept sequences could be recognised and learnt by something resembling a neural
network, for example. The uncertainty of the real world would mean that concept
sequences are unlikely to always be the same, and so key to the success is the
ability to generalise over the data and also to accommodate a certain level of
randomness or noise. The intention is that the neural network will be able to do this
relatively well. It is also true that there is a lot of existing structure already available
in information sources, but it might not be clear what the best form of that is. Online
datasets, for example, can be continuous streams of information, de
'
use
'
ned by time
stamps. While the data will contain structure, there is no clearly de
ned start or end,
but more of a continuous and cyclic list of information, from which clear patterns
need to be recognised.
As well as events, text might be presented in the form of static documents or
papers that need to be classi
ed. For the proposed system, there are some simple
answers to the problem of how to recognise the existing structure. The author has
also been working on a text-based processing application. One feature of the text
processor is the ability to generate sorted list of words from whole text documents.
Word lists can also appear as cyclic lists and patterns can again be recognised. This
current section of text, for example, is a list of words with nested patterns. In that
case, structure could be recognised as a sequence, ending when the word that started
the sequence is encountered again. To sort the text, each term in the sequence could
be assigned a count of the number of times it has occurred, as part of the sequence.
How many times does
for example, but a sequence can be
more than one word deep. Sequences that contain the same words, or overlap, can
be combined, to create the concept trees in the concept base. To select starting or
base words, for example, a bag-of-words with frequency counts can determine the
most popular ones. The decision might then be to read or process text sequences
only if they start with these key words. Pre-formatting or
'
tree
'
follow
'
concept
'
filtering of the text can
also be performed. Because this information would be created from existing text
documents, the process would be more semantic and knowledge-based. This does
not exclude the addition of a time element however and a global system would
bene
t from using all of these attributes.
The concept trees can then evolve, adding and updating branches as new
information is received. Processing just a few test documents however shows that
different word sorts of the original data will produce different sequences, from
which these basic structures are built, so the decision of correct structure is still
quite arbitrary. On the technical front, it might be more correct to always
use complete lists of concepts, as they are presented or received, and then try to
Search WWH ::




Custom Search