Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

What's so special about text data?

Text data can be complex to work with for two main reasons. First, text and language have

an inherent structure that is not easily captured using the raw words as is (for example,

meaning, context, different types of words, sentence structure, and different languages, to

highlight a few). Therefore, naïve feature extraction is usually relatively ineffective.

Second, the effective dimensionality of text data is extremely large and potentially limit-

less. Think about the number of words in the English language alone and add all kinds of

special words, characters, slang, and so on to this. Then, throw in other languages and all

the types of text one might find across the Internet. The dimension of text data can easily

exceed tens or even hundreds of millions of words, even in relatively small datasets. For

example, the Common Crawl dataset of billions of websites contains over 840 billion indi-

vidual words.

To deal with these issues, we need ways of extracting more structured features and methods

to handle the huge dimensionality of text data.

Search WWH ::

Custom Search

Home