Database Reference
In-Depth Information
What's so special about text data?
Text data can be complex to work with for two main reasons. First, text and language have
an inherent structure that is not easily captured using the raw words as is (for example,
meaning, context, different types of words, sentence structure, and different languages, to
highlight a few). Therefore, naïve feature extraction is usually relatively ineffective.
Second, the effective dimensionality of text data is extremely large and potentially limit-
less. Think about the number of words in the English language alone and add all kinds of
special words, characters, slang, and so on to this. Then, throw in other languages and all
the types of text one might find across the Internet. The dimension of text data can easily
exceed tens or even hundreds of millions of words, even in relatively small datasets. For
example, the Common Crawl dataset of billions of websites contains over 840 billion indi-
vidual words.
To deal with these issues, we need ways of extracting more structured features and methods
to handle the huge dimensionality of text data.
Search WWH ::




Custom Search