Database Reference
In-Depth Information
11314
Next, we will take a look at the newsgroup topics available:
val newsgroups = rdd.map { case (file, text) =>
file.split("/").takeRight(2).head }
val countByGroup = newsgroups.map(n => (n,
1)).reduceByKey(_ + _).collect.sortBy(-_._2).mkString("\n")
println(countByGroup)
This will display the following result:
(rec.sport.hockey,600)
(soc.religion.christian,599)
(rec.motorcycles,598)
(rec.sport.baseball,597)
(sci.crypt,595)
(rec.autos,594)
(sci.med,594)
(comp.windows.x,593)
(sci.space,593)
(sci.electronics,591)
(comp.os.ms-windows.misc,591)
(comp.sys.ibm.pc.hardware,590)
(misc.forsale,585)
(comp.graphics,584)
(comp.sys.mac.hardware,578)
(talk.politics.mideast,564)
(talk.politics.guns,546)
(alt.atheism,480)
(talk.politics.misc,465)
(talk.religion.misc,377)
We can see that the number of messages is roughly even between the topics.
Applying basic tokenization
The first step in our text processing pipeline is to split up the raw text content in each doc-
ument into a collection of terms (also referred to as tokens ). This is known as tokeniza-
Search WWH ::




Custom Search