Information Technology Reference
In-Depth Information
An example is what might be described as “universal” indexing methods. In
such methods, the object to be indexed—whether an image, movie, audio file, or
text document—is manipulated in some way, for example by a particular kind of
hash function. After this manipulation, objects of different type can be compared:
thus, somehow, documents about swimming pools and images of swimming pools
would have the same representation. Such matching is clearly an extremely difficult
problem, if not entirely insoluble; for instance, how does the method know to focus
on the swimming pool rather than some other element of the image, such as children,
sunshine, or its role as a metaphor for middle-class aspirations? 3
In some work, the evidence or methods are internally inconsistent. For example,
in a paper on how to find documents on a particular topic, the authors reported that
the method correctly identified 20,000 matches in a large document collection. But
this is a deeply improbable outcome. The figure of 20,000 hints at imprecision—it is
too round a number. More significantly, verifying that all 20,000 were matches would
require many months of effort. No mention was made of the documents that weren't
matches, implying that the method was 100% accurate; but even the best document-
matching methods have high error rates. A later paper by the same authors gave
entirely different results for the same method, while claiming similar good results
for a new method, thus throwing doubt on the whole research program. And it is a
failure of logic to suppose that the fact that two documents match according to some
arbitrary algorithm implies that the match is useful to a user.
The logic underlying some papers is outright mystifying. To an author, it may
seem a major step to identify and solve a new problem, but such steps can go too
far. A paper on retrieval for a specific form of graph used a new query language
and matching technique, a new way of evaluating similarity, and data based on a
new technique for deriving the graphs from text and semantically (that word again!)
labelling the edges. Every element of this paper was a separate contribution whose
merit could be disputed. Presented in a brief paper, the work seemed worthless.
Inventing a problem, a solution to the problem, and a measure of the solution—all
without external justification—is a widespread form of bad science. 4
(Footnote 2 continued)
the 7 kilobytes that such a modem could transmit per second. Uncompressed, the bandwidth of a
modem was only sufficient for one byte per row per image, or, per image, about the space needed to
transmit a desktop icon. A further skeptical consideration in this case was that an audio signal was
also transmitted. Had the system been legitimate, the inventor must have developed new solutions
to the independent problems of image compression, motion encoding, and audio compression.
3 In another variant of this theme, objects of the same type were clustered together using some kind
of similarity metric. Then the patterns of clustering were analyzed, and objects that clustered in
similar ways were supposed to have similar subject matter. Although it is disguised by the use of
clustering, to be successful such an approach assumes an underlying universal matching method.
4 An interesting question is how to regard “Zipf's law”. This observation—“law” seems a poor
choice of terminology in this context—is if nothing else a curious case study. Zipf's topics may
be widely cited but they are not, I suspect, widely read. In Human Behaviour and the Principle of
Least Effort (Addison-Wesley, 1949), Zipf used languages and word frequencies as one of several
examples to illustrate his observation, but his motivation for the work is not quite what might
be expected. He states, for example, that his research “define[s] objectivelywhat wemean by the term
Search WWH ::




Custom Search