Database Reference
In-Depth Information
Table 7-1. Python versus SQL data types
Type
Python
SQL
Character string
unicode
VARCHAR
Boolean
bool
BOOLEAN
Integer
int
INTEGER
Real number loat
FLOAT
Date
datetime.date
DATE
Time
datetime.time
TIME
Date and time
datetime.datetime
DATETIME
Unique Identiiers, Continuous Variables, and Factors
Knowing the data type of each feature is not enough. It's also essential to know what
each feature represents. Having knowledge about the domain is very useful here,
however we may also get some ideas from the data itself.
Both a string and an integer could be a unique identifier or could represent a cate‐
gory. In the latter case, this could be used to assign a color to your visualization. If an
integer denotes, say, the ZIP code, then it doesn't make sense to compute the average.
To determine whether a feature should be treated as a unique identifier or categorical
variable (or factor in R terminology), you could count the number of unique values
for a specific column:
$ cat data/iris.csv | csvcut -c species | body "sort | uniq | wc -l"
species
3
Or we can use csvstat (Groskopf, 2014), which is part of Csvkit, to get the number
of unique values for each column:
$ csvstat data/investments2.csv --unique
1. company_permalink: 27342
2. company_name: 27324
3. company_category_list: 8759
4. company_market: 443
5. company_country_code: 150
6. company_state_code: 147
7. company_region: 1079
8. company_city: 3305
9. investor_permalink: 11176
10. investor_name: 11135
 
Search WWH ::




Custom Search