Database Reference
In-Depth Information
11. investor_category_list: 468
12. investor_market: 134
13. investor_country_code: 111
14. investor_state_code: 80
15. investor_region: 549
16. investor_city: 1198
17. funding_round_permalink: 41790
18. funding_round_type: 13
19. funding_round_code: 15
20. funded_at: 3595
21. funded_month: 295
22. funded_quarter: 121
23. funded_year: 34
24. raised_amount_usd: 6143
If the number of unique values is low compared to the number of rows in the data set,
then that feature may indeed be treated as a categorical one (such as fund
ing_round_type ). If the number is equal to the number of rows, it may be a unique
identifier (such as company_permalink ).
Computing Descriptive Statistics
Using csvstat
The command-line tool csvstat gives a lot of information. For each feature it shows:
• The data type in Python terminology (see Table 7-1 for a comparison between
Python and SQL data types)
• Whether it has any missing values ( Null s)
• The number of unique values
• Various descriptive statistics (i.e., maximum, minimum, sum, mean, standard
deviation, and median) for those features for which it's appropriate
We invoke csvstat as follows:
$ csvstat data/datatypes.csv
1. a
<type 'int'>
Nulls: False
Values: 2, 66, 42
2. b
<type 'float'>
Nulls: True
Values: 0.0, 3.1415
3. c
<type 'bool'>
Nulls: False
Unique values: 2
Search WWH ::




Custom Search