Information Technology Reference
In-Depth Information
structure of language empirically, which is to be done computationally by the
statistical analysis of actual samples of human language. In other words, in the
building of “language processing programs which had a sound philosophical basis”
(Wilks 2005).
One of the six students of Wittgenstein's course that became The Blue Book ,
Masterman was exposed directly by Wittgenstein to the conceptual apparatus of
the Philosophical Investigations (Sowa 2006). Twenty years later, she founded
the Cambridge Language Research Unit, where the foundations for information
retrieval were laid by a student of Masterman and Masterman's husband Richard
Braithwaithe, Karen Sparck Jones (Wilks 2007). In her dissertation Synonymy
and Semantic Classification ,Sparck Jones stated that her dissertation proposed “a
characterisation of, and a basis for deriving, semantic primitives, i.e. the general
concepts under which natural language words and messages are categorized”
(Sparck Jones 1986). She did this by applying the statistical 'Theory of Clumps'
of Roger Needham - a theory that was itself one of the first to explicate what
Wittgenstein called “family resemblances” - to words themselves, leading her to
posit that words could be defined in terms of statistical clumps of other words
(Needham 1962). Her technique prefigures much of the later work in the 'statistical
turn' of natural language research and our own work in statistical notions of sense
based on terms in the previous two chapters. Applying her work over larger and
larger sources of natural language data, she later abandoned even the open-ended
semantic primitives of Masterman. In her later critique of artificial intelligence, she
argued that one of the key insights of information retrieval is that programs should
take “words as they stand” and not as only adjuncts to some logical knowledge
representation system (Sparck Jones 1999). The connection to search engines is
clear: Altavista, the first modern Web search engine, was created after its inventor,
Mike Burrows, e-mailed Sparck Jones and Needham over techniques in information
retrieval.
Search engines work via analysis of existing web-pages, breaking them down
into terms, and then mapping those terms and their frequencies in a given web-
page into a large index. So, each URI can be thought of as collection of terms
in this search engine index. As the collection of term frequencies gathered into
this index grows, ranging over larger and larger sources of data like the Web, it
approximates human language use, as has been shown by studies in computational
linguistics (Keller and Lapata 2003). Users of a search engine then enter certain
terms, the search query is mapped via certain algorithms against the index. This
results in an unordered list of possibly relevant URIs, which for an index that covers
the entire Web range from thousands to millions of URIs. In turn these URIs are
then ranked and ordered using an algorithm such as Google's famous PageRanking
algorithm, possibly with user feedback (Brin and Page 1998). To explicate how
user-based relevance feedback works, search engines usually keep track of what
URIs are actually clicked on by users. This stream of clicks by multiple users can
be stored in a query log, and this query log can then be used to improve the discovery
and ranking of URIs by search engines. By inspecting which terms lead to which
URIs for multiple users, a set of terms that best describes a URI for users can be
discovered. In this way, typing in terms into a search engine can be thought of as
Search WWH ::




Custom Search