Database Reference
In-Depth Information
Table 9.1
Example Corpora in Natural Language Processing
Corpus
Word
Count
Domain Website
Shakespeare
0.88
million
Written
http://shakespeare.mit.edu/
Brown Corpus
1
million
Written
http://icame.uib.no/brown/
Penn Treebank
1
million
Newswire
http://www.cis.upenn.edu/
Switchboard Phone
Conversations
3
million
Spoken
British National
Corpus
100
million
Written
and
spoken
NA News Corpus
350
million
Newswire
http://catalog.ldc.upenn.edu/
European Parliament
Proceedings Parallel
Corpus
600
million
Legal
Google N-Grams
Corpus
1
trillion
The smallest corpus in the list, the complete works of Shakespeare, contains about
0.88 million words. In contrast, the Google
n
-gram corpus contains one trillion
words from publicly accessible web pages. Out of the one trillion words in the
Google
n
-gram corpus, there might be one million distinct words, which would
correspond to one million dimensions. The high dimensionality of text is an
important issue, and it has a direct impact on the complexities of many text
analysis tasks.
Another major challenge with text analysis is that most of the time the text is not
structured. As introduced in Chapter 1, “Introduction to Big Data Analytics,” this
may include quasi-structured, semi-structured, or unstructured data.
Table 9.2
shows some example data sources and data formats that text analysis may have to