Database Reference
In-Depth Information
Several years later, when I was at Endeca, I found myself working on terminology
extraction and had to confront the noise problems personally. Ironically, we
ended up licensing our terminology extraction algorithms to IBM as part of a
search application we built for them.
Gutierrez: What was the first data set you worked with?
Tunkelang: I feel bad that I can't remember my first, as that makes it sound
like it wasn't a deep, meaningful experience! I did spend a lot of time working
with a Reuters news corpus to test out information retrieval and information
extraction algorithms. One of the great things about my time at Endeca was
the opportunity to work with our customers' data, especially when we were
prototyping new product features.
Gutierrez: How did the fact that the Endeca data was customers' data make
you think about the data?
Tunkelang: It was nice to have a diverse set of customers and thus gain expo-
sure to lots of different problems. But the price of working as an enterprise
software vendor was that our relationship to the users was always indirect. So
we couldn't just decide to run experiments and observe the impact.
Gutierrez: Was there a specific aha! moment when you realized the power
of data?
Tunkelang: Not sure there's a single moment, but there were two unfor-
gettable moments in my relationship with data. The first was when I was
working with a digital library and realized we could dramatically improve
document tagging by algorithmically recycling author-supplied labels. While
authors tagged articles with keywords and phrases, the tagging was sparse
and inconsistent. As a result of this type of tagging, the use of tags for article
retrieval offered high precision but low recall. Unfortunately, the alternative
of performing full-text search on the tags provided unacceptably low preci-
sion. So we developed a system to bootstrap on author-supplied tags, thus
improving tagging across the collection. The result was an order of magnitude
increase in recall without sacrificing precision.
The second was using entropy calculations on language models to automatically
detect events in a news archive. We started by performing entity extraction
on the archive to detect named entities and key phrases. Then, when we
performed a search, for example “iraq”, we could compute the language model
for the search results and track it over the time span of the collection. What
we found is that sudden changes in the language model corresponded to
events. I only had the opportunity to build prototypes with this system, but
I did have the chance to demo them to people at three-letter agencies.
 
Search WWH ::




Custom Search