Text Search-Enhanced with Types and Entities - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

Chapter 10

Text Search-Enhanced with

Types and Entities

Soumen Chakrabarti, Sujatha Das, Vijay Krishnan, and Kriti

Puniyani

10.1 Entity-Aware Search Architecture

.......................................

233

10.2 Understanding the Question

.............................................

236

10.3 Scoring Potential Answer Snippets

......................................

251

.........................................

10.4 Indexing and Query Processing

260

10.5 Conclusion

...............................................................

272

10.1 Entity-Aware Search Architecture

Until recently, large-scale text and Web search systems regarded a document

as a sequence of string tokens. Queries were also comprised of string tokens,

and the search engine's job was to assign a score to each document based on

the extent of matches between query and document tokens, the rarity of the

query tokens in the corpus, and, more recently, the “prestige” of the Web

document in the social network of hyperlinks.

Several parallel and interrelated developments have changed this state

of affairs in the last few years. Some smaller scale search applications

were already more heavily invested in computational linguistics and natural

language processing (NLP), and those technologies are being imported into

and scaled up to benefit large-scale search. Machine learning techniques

for tagging entities mentioned in unstructured text have become quite

sophisticated, scalable and robust. XML is often used to represent typed

entity-relationship graphs, and query engines for XML already support graph

idioms that are common in entity extraction and NLP.

Gradually, Web search engines have turned to quite a bit of interpretation

of string tokens against the backdrop of our physical world. A five-digit

number is interpreted as a zipcode in some contexts. Many named entities

are recognized and exploited:

•

Recognizing that a query is a person name triggers a “diversity”

Search WWH ::

Custom Search

Home