Database Reference
In-Depth Information
2
Cleaning and
Validating Data
In this chapter, we will cover the following recipes:
F Cleaning data with regular expressions
F Maintaining consistency with synonym maps
F Identifying and removing duplicate data
F Regularizing numbers
F Calculating relative values
F Parsing dates and times
F Lazily processing very large data sets
F Sampling from very large data sets
F Fixing spelling errors
F Parsing custom data formats
F Validating data with Valip
Introduction
You probably won't spend as much time to get the data as you will in trying to get it into shape.
Raw data is often inconsistent, duplicated, or full of holes. Addresses might be missing, years
and dates might be formatted in a thousand different ways, or names might be entered into
the wrong ields. You'll have to ix these issues before the data is usable.
Search WWH ::




Custom Search