Exploring Data - Data Science at the Command Line

Database Reference

In-Depth Information

11. investor_category_list: 468

12. investor_market: 134

13. investor_country_code: 111

14. investor_state_code: 80

15. investor_region: 549

16. investor_city: 1198

17. funding_round_permalink: 41790

18. funding_round_type: 13

19. funding_round_code: 15

20. funded_at: 3595

21. funded_month: 295

22. funded_quarter: 121

23. funded_year: 34

24. raised_amount_usd: 6143

If the number of unique values is low compared to the number of rows in the data set,

then that feature may indeed be treated as a categorical one (such as fund

ing_round_type ). If the number is equal to the number of rows, it may be a unique

identifier (such as company_permalink ).

Computing Descriptive Statistics

Using csvstat

The command-line tool csvstat gives a lot of information. For each feature it shows:

• The data type in Python terminology (see Table 7-1 for a comparison between

Python and SQL data types)

• Whether it has any missing values ( Null s)

• The number of unique values

• Various descriptive statistics (i.e., maximum, minimum, sum, mean, standard

deviation, and median) for those features for which it's appropriate

We invoke csvstat as follows:

$ csvstat data/datatypes.csv

1. a

Nulls: False

Values: 2, 66, 42

2. b

Nulls: True

Values: 0.0, 3.1415

3. c

Nulls: False

Unique values: 2

Search WWH ::

Custom Search

Home