Database Reference
In-Depth Information
How will you go about ixing this? It might be easier to have separate versions of this
function for integers and loats. In the end, you need to know your data in order to decide
how to best handle it.
Calculating relative values
One way to normalize values is to scale frequencies by the sizes of their groups. For example,
say the word truth appears three times in a document. This means one thing if the document
has thirty words. It means something else if the document has 300 or 3,000 words. Moreover,
if the dataset has documents of all these lengths, how do you compare the frequencies for
words across documents?
One way to do this is to rescale the frequency counts. In some cases, we can just scale the
terms by the length of the documents. Or, if we want better results, we might use something
more complicated such as term frequency-inverse document frequency (TF-IDF).
For this recipe, we'll rescale some term frequencies by the total word count for their document.
Getting ready
We don't need much for this recipe. We'll use the minimal project.clj ile, which is
listed here:
(defproject cleaning-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]])
However, it will be easier if we have a 'pretty-printer' available in the REPL:
(require '[clojure.pprint :as pp])
How to do it…
Actually, let's frame this problem in a more abstract manner. If each datum is a map, we can
rescale one key ( :frequency ) by the total of this key's values in the group deined by another
key ( :document ). This is a more general approach and should be useful in more situations.
1. Let's deine a function that rescales by a key's total in a collection. It assigns the
scaled value to a new key ( dest ):
(defn rescale-by-total [src dest coll]
(let [total (reduce + (map src coll))
update #(assoc % dest (/ (% src) total))]
(map update coll)))
 
Search WWH ::




Custom Search