Database Reference
In-Depth Information
archive-name:,atheism/resources
alt-atheism-archive-name:,december,,,,,,,,,,,,,,,,,,,,,,addresses,addresses,,,,,,,religion,to:,to:,,p.o.,53701.
telephone:,sell,the,,fish,on,their,cars,,with,and,written
inside.,3d,plastic,plastic,,evolution,evolution,7119,,,,,san,san,san,mailing,net,who,to,atheist,press
aap,various,bible,,and,on.,,,one,book,is:
"the,w.p.,american,pp.,,1986.,bible,contains,ball,,based,based,james,of
Improving our tokenization
The preceding simple approach results in a lot of tokens and does not filter out many non-
word characters (such as punctuation). Most tokenization schemes will remove these char-
acters. We can do this by splitting each raw document on nonword characters using a
regular expression pattern:
val nonWordSplit = text.flatMap(t =>
t.split("""\W+""").map(_.toLowerCase))
println(nonWordSplit.distinct.count)
This reduces the number of unique tokens significantly:
130126
If we inspect the first few tokens, we will see that we have eliminated most of the less
useful characters in the text:
println(nonWordSplit.distinct.sample(true, 0.3,
42).take(100).mkString(","))
You will see the following result displayed:
bone,k29p,w1w3s1,odwyer,dnj33n,bruns,_congressional,mmejv5,mmejv5,artur,125215,entitlements,beleive,1pqd9hinnbmi,
jxicaijp,b0vp,underscored,believiing,qsins,1472,urtfi,nauseam,tohc4,kielbasa,ao,wargame,seetex,museum,typeset,pgva4,
dcbq,ja_jp,ww4ewa4g,animating,animating,10011100b,10011100b,413,wp3d,wp3d,cannibal,searflame,ets,1qjfnv,6jx,6jx,
detergent,yan,aanp,unaskable,9mf,bowdoin,chov,16mb,createwindow,kjznkh,df,classifieds,hour,cfsmo,santiago,santiago,
1r1d62,almanac_,almanac_,chq,nowadays,formac,formac,bacteriophage,barking,barking,barking,ipmgocj7b,monger,projector,
hama,65e90h8y,homewriter,cl5,1496,zysec,homerific,00ecgillespie,00ecgillespie,mqh0,suspects,steve_mullins,io21087,
funded,liberated,canonical,throng,0hnz,exxon,xtappcontext,mcdcup,mcdcup,5seg,biscuits
Search WWH ::




Custom Search