Database Reference
In-Depth Information
While our nonword pattern to split text works fairly well, we are still left with numbers
and tokens that contain numeric characters. In some cases, numbers can be an important
part of a corpus. For our purposes, the next step in our pipeline will be to filter out num-
bers and tokens that are words mixed with numbers.
We can do this by applying another regular expression pattern and using this to filter out
tokens that do not match the pattern:
val regex = """[^0-9]*""".r
val filterNumbers = nonWordSplit.filter(token =>
regex.pattern.matcher(token).matches)
println(filterNumbers.distinct.count)
This further reduces the size of the token set:
84912
Let's take a look at another random sample of the filtered tokens:
println(filterNumbers.distinct.sample(true, 0.3,
42).take(100).mkString(","))
You will see output like the following one:
reunion,wuair,schwabam,eer,silikian,fuller,sloppiness,crying,crying,beckmans,leymarie,fowl,husky,rlhzrlhz,ignore,
loyalists,goofed,arius,isgal,dfuller,neurologists,robin,jxicaijp,majorly,nondiscriminatory,akl,sively,adultery,
urtfi,kielbasa,ao,instantaneous,subscriptions,collins,collins,za_,za_,jmckinney,nonmeasurable,nonmeasurable,
seetex,kjvar,dcbq,randall_clark,theoreticians,theoreticians,congresswoman,sparcstaton,diccon,nonnemacher,
arresed,ets,sganet,internship,bombay,keysym,newsserver,connecters,igpp,aichi,impute,impute,raffle,nixdorf,
nixdorf,amazement,butterfield,geosync,geosync,scoliosis,eng,eng,eng,kjznkh,explorers,antisemites,bombardments,
abba,caramate,tully,mishandles,wgtn,springer,nkm,nkm,alchoholic,chq,shutdown,bruncati,nowadays,mtearle,eastre,
discernible,bacteriophage,paradijs,systematically,rluap,rluap,blown,moderates
We can see that we have removed all the numeric characters. This still leaves us with a
few strange words , but we will not worry about these too much here.
Removing stop words
Stop words refer to common words that occur many times across almost all documents in
a corpus (and across most corpuses). Examples of typical English stop words include and,
Search WWH ::




Custom Search