Daniel Tunkelang - Data Scientists at Work

Database Reference

In-Depth Information

Several years later, when I was at Endeca, I found myself working on terminology

extraction and had to confront the noise problems personally. Ironically, we

ended up licensing our terminology extraction algorithms to IBM as part of a

search application we built for them.

Gutierrez: What was the first data set you worked with?

Tunkelang: I feel bad that I can't remember my first, as that makes it sound

like it wasn't a deep, meaningful experience! I did spend a lot of time working

with a Reuters news corpus to test out information retrieval and information

extraction algorithms. One of the great things about my time at Endeca was

the opportunity to work with our customers' data, especially when we were

prototyping new product features.

Gutierrez: How did the fact that the Endeca data was customers' data make

you think about the data?

Tunkelang: It was nice to have a diverse set of customers and thus gain expo-

sure to lots of different problems. But the price of working as an enterprise

software vendor was that our relationship to the users was always indirect. So

we couldn't just decide to run experiments and observe the impact.

Gutierrez: Was there a specific aha! moment when you realized the power

of data?

Tunkelang: Not sure there's a single moment, but there were two unfor-

gettable moments in my relationship with data. The first was when I was

working with a digital library and realized we could dramatically improve

document tagging by algorithmically recycling author-supplied labels. While

authors tagged articles with keywords and phrases, the tagging was sparse

and inconsistent. As a result of this type of tagging, the use of tags for article

retrieval offered high precision but low recall. Unfortunately, the alternative

of performing full-text search on the tags provided unacceptably low preci-

sion. So we developed a system to bootstrap on author-supplied tags, thus

improving tagging across the collection. The result was an order of magnitude

increase in recall without sacrificing precision.

The second was using entropy calculations on language models to automatically

detect events in a news archive. We started by performing entity extraction

on the archive to detect named entities and key phrases. Then, when we

performed a search, for example “iraq”, we could compute the language model

for the search results and track it over the time span of the collection. What

we found is that sudden changes in the language model corresponded to

events. I only had the opportunity to build prototypes with this system, but

I did have the chance to demo them to people at three-letter agencies.

Search WWH ::

Custom Search

Home