Database Reference
In-Depth Information
How it works…
This recipe processes a document through a number of stages to get the results:
1. We'll read in the ile and pull out the words.
2.
We'll partition the tokens into chunks of 500 tokens, each overlapping by 250 tokens.
This allows us to deal with localized parts of the document. Each partition is large
enough to be interesting, but small enough to be narrowly focused.
3.
For each window, we'll get the frequency of the term baker . This data is kind of spiky.
This is ine for some applications, but we may want to smooth it out to make the data
less noisy and to show the trends better.
4.
So, we'll break the sequence of frequencies of baker into a rolling set of ten windows.
Each set is offset from the previous set by one.
5.
We'll then get the average frequency for each set of frequencies. This removes much
of the variability and spikiness from the raw data, but it maintains the general shape
of the data. We can still see the spike around 220 in the preceding screenshot.
By the way, that spike is from the short story, The Adventure of the Blue Carbuncle .
A character in that story is Henry Baker , so the spike is not just from references to Baker
Street, but also to the character.
Validating sample statistics with
bootstrapping
When working with sampled data, we need to produce descriptive statistics. We want to know
how accurate our estimates are, which is known as standard error of the estimate.
Bootstrapping is a way to estimate the standard errors of the estimate when we can't directly
observe the data. Bootstrapping works by repeatedly taking samples of the chosen sample,
allowing items to be included in the secondary sample multiple times. Doing this over and
over allows us to estimate the standard error.
We can use bootstrapping when the sample we're working with is small, or when we don't
know the distribution of the sample's population.
 
Search WWH ::




Custom Search