Database Reference
In-Depth Information
Table 9.1 Example Corpora in Natural Language Processing
Corpus
Word
Count
Domain Website
Shakespeare
0.88
million
Written http://shakespeare.mit.edu/
Brown Corpus
1
million
Written http://icame.uib.no/brown/
bcm.html
Penn Treebank
1
million
Newswire http://www.cis.upenn.edu/
˜treebank/
Switchboard Phone
Conversations
3
million
Spoken
http://catalog.ldc.upenn.edu/
LDC97S62
British National
Corpus
100
million
Written
and
spoken
http://www.natcorp.ox.ac.uk/
NA News Corpus
350
million
Newswire http://catalog.ldc.upenn.edu/
LDC95T21
European Parliament
Proceedings Parallel
Corpus
600
million
Legal
http://www.statmt.org/
europarl/
Google N-Grams
Corpus
1
trillion
Written http://catalog.ldc.upenn.edu/
LDC2006T13
The smallest corpus in the list, the complete works of Shakespeare, contains about
0.88 million words. In contrast, the Google n -gram corpus contains one trillion
words from publicly accessible web pages. Out of the one trillion words in the
Google n -gram corpus, there might be one million distinct words, which would
correspond to one million dimensions. The high dimensionality of text is an
important issue, and it has a direct impact on the complexities of many text
analysis tasks.
Another major challenge with text analysis is that most of the time the text is not
structured. As introduced in Chapter 1, “Introduction to Big Data Analytics,” this
may include quasi-structured, semi-structured, or unstructured data. Table 9.2
shows some example data sources and data formats that text analysis may have to
Search WWH ::




Custom Search