Database Reference
In-Depth Information
The maximum distance can also be a percentage of the length of the shortest input string.
In this case, we're using 10 percent as the maximum difference value that 2 can be.
If either of these two conditions is met, the strings are determined to be the same. This leads
to two scenarios. No matter the length of the string, if only two characters change, it's the
same. This is problematic for very short strings.
On the other hand, a hard maximum distance doesn't work for very long strings either.
If, say for example, the value is 200 characters or more, you'll want to allow more absolute
characters of difference than you would for a string of 20 characters. fuzzy-percent-diff
provides this lexibility.
There's moreā€¦
As I mentioned, this will not handle short strings very well. For example, it will judge ace and
are to be the same. We can make the logic more complicated by adding a clause that says
only to use fuzzy-max-diff if the length of the string is greater than some value.
In this recipe, we used clj-diff.core/edit-distance . This measures the number
of changes that need to be made in order to transform one string into the other with the
single-character operations insert and delete . Another option is to use clj-diff.core/
levenshtein-distance , which also uses a single-character replace operation.
Regularizing numbers
If we need to read in numbers as strings, we have to worry about how they're formatted.
However, we'll probably want the computer to deal with them as numbers , not as strings, and
this can't happen if the string contains a comma or period to separate the thousands place.
This allows the numbers to be sorted and to be available for mathematical functions.
In this recipe, we'll write a short function that takes a number string and returns the number.
The function will strip out all of the extra punctuation inside the number and only leave the
last separator. Hopefully, this will be the one that marks the decimal place.
Of course, the version of this function, which we'll see here, only works in locales that use
commas to separate thousands and periods to separate decimals. However, it would be
relatively easy to write versions that will work in any particular locale.
Getting ready
For this recipe, we're back to the most simple project.clj iles:
(defproject cleaning-data "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]])
 
Search WWH ::




Custom Search