Exploring Data - Data Science at the Command Line

Database Reference

In-Depth Information

Table 7-1. Python versus SQL data types

Type

Python

SQL

Character string

unicode

VARCHAR

Boolean

bool

BOOLEAN

Integer

int

INTEGER

Real number loat

FLOAT

Date

datetime.date

DATE

Time

datetime.time

TIME

Date and time

datetime.datetime

DATETIME

Unique Identiiers, Continuous Variables, and Factors

Knowing the data type of each feature is not enough. It's also essential to know what

each feature represents. Having knowledge about the domain is very useful here,

however we may also get some ideas from the data itself.

Both a string and an integer could be a unique identifier or could represent a cate‐

gory. In the latter case, this could be used to assign a color to your visualization. If an

integer denotes, say, the ZIP code, then it doesn't make sense to compute the average.

To determine whether a feature should be treated as a unique identifier or categorical

variable (or factor in R terminology), you could count the number of unique values

for a specific column:

$ cat data/iris.csv | csvcut -c species | body "sort | uniq | wc -l"

species

3

Or we can use csvstat (Groskopf, 2014), which is part of Csvkit, to get the number

of unique values for each column:

$ csvstat data/investments2.csv --unique

1. company_permalink: 27342

2. company_name: 27324

3. company_category_list: 8759

4. company_market: 443

5. company_country_code: 150

6. company_state_code: 147

7. company_region: 1079

8. company_city: 3305

9. investor_permalink: 11176

10. investor_name: 11135

Data Science at the Command Line

Search WWH ::

Custom Search

Home